[PR #23874] [CLOSED] fix: extract text from PDF URLs in fetch_url tool #66276

New Issue

GiteaMirror · 2026-05-06T12:32:43-05:00

GiteaMirror commented

2026-05-06 12:32:43 -05:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/23874
Author: @gaurav0107
Created: 4/19/2026
Status: ❌ Closed

Base: dev ← Head: fix/fetch-url-pdf-handling

📝 Commits (1)

9d3d80b fix: extract text from PDF URLs in fetch_url tool

📊 Changes

3 files changed (+485 additions, -2 deletions)

View changed files

📝 backend/open_webui/retrieval/utils.py (+109 -2)
📝 backend/open_webui/retrieval/web/utils.py (+33 -0)
➕ backend/tests/retrieval/test_pdf_handling.py (+343 -0)

📄 Description

Pull Request Checklist

Target branch: Verify that the pull request targets the dev branch.
Description: Provided below.
Changelog: Included below.
Dependencies: No new dependencies. pypdf is already pinned in requirements.txt (pypdf==6.7.5).
Testing: 12 unit tests included. Manual verification performed.
Agentic AI Code: This PR has gone through human review and manual testing.
Code review: Self-reviewed. Follows existing project patterns.
Git Hygiene: Atomic PR — one logical change, rebased on dev.

Changelog Entry

Description

When fetch_url fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction so that PDF URLs return clean, readable text.

Closes #23841

Added

extract_text_from_pdf_bytes() — extracts text from raw PDF bytes using pypdf
extract_pdf_from_url() — downloads a PDF and extracts text with SSRF protection, streaming size limits, and proper session management
Three-layer PDF detection in the web retrieval pipeline:
- Fast-path for URLs ending in .pdf (case-insensitive)
- Content-Type: application/pdf header detection in async SafeWebBaseLoader._fetch()
- %PDF binary content fallback in get_content_from_url()
12 unit tests in backend/tests/retrieval/test_pdf_handling.py

Changed

get_content_from_url() now detects and properly handles PDF content instead of piping binary through BeautifulSoup
Fixed return type annotation: get_content_from_url() -> tuple[str, list[Document]]

Deprecated

N/A

Removed

N/A

Fixed

fetch_url tool now returns extracted text instead of garbled binary when fetching PDF URLs (#23841)
Image-only PDFs return a clear placeholder message instead of empty/corrupted content

Security

SSRF protection via validate_url() before any PDF download
Streaming download (stream=True) with Content-Length header pre-check to reject oversized PDFs before buffering
50 MB size limit enforced on both header and actual body size
requests.Session used as context manager to prevent connection leaks
Reuses existing SSL-verification and proxy settings from app config

Breaking Changes

None

Additional Information

pypdf is already a project dependency (requirements.txt: pypdf==6.7.5), used by the existing PyPDFLoader
The fix covers both sync (get_content_from_url → loader.load()) and async (SafeWebBaseLoader._fetch()) code paths
Run tests: PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v

Screenshots or Videos

N/A — backend-only change, no UI impact.

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/23874 **Author:** [@gaurav0107](https://github.com/gaurav0107) **Created:** 4/19/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/fetch-url-pdf-handling` --- ### 📝 Commits (1) - [`9d3d80b`](https://github.com/open-webui/open-webui/commit/9d3d80b9f8df851d475dbd1951cb34360920b80e) fix: extract text from PDF URLs in fetch_url tool ### 📊 Changes **3 files changed** (+485 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/utils.py` (+109 -2) 📝 `backend/open_webui/retrieval/web/utils.py` (+33 -0) ➕ `backend/tests/retrieval/test_pdf_handling.py` (+343 -0) </details> ### 📄 Description  # Pull Request Checklist - [x] **Target branch:** Verify that the pull request targets the `dev` branch. - [x] **Description:** Provided below. - [x] **Changelog:** Included below. - [x] **Dependencies:** No new dependencies. `pypdf` is already pinned in `requirements.txt` (`pypdf==6.7.5`). - [x] **Testing:** 12 unit tests included. Manual verification performed. - [x] **Agentic AI Code:** This PR has gone through human review and manual testing. - [x] **Code review:** Self-reviewed. Follows existing project patterns. - [x] **Git Hygiene:** Atomic PR — one logical change, rebased on `dev`. # Changelog Entry ### Description When `fetch_url` fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction so that PDF URLs return clean, readable text. Closes #23841 ### Added - `extract_text_from_pdf_bytes()` — extracts text from raw PDF bytes using `pypdf` - `extract_pdf_from_url()` — downloads a PDF and extracts text with SSRF protection, streaming size limits, and proper session management - Three-layer PDF detection in the web retrieval pipeline: - Fast-path for URLs ending in `.pdf` (case-insensitive) - `Content-Type: application/pdf` header detection in async `SafeWebBaseLoader._fetch()` - `%PDF` binary content fallback in `get_content_from_url()` - 12 unit tests in `backend/tests/retrieval/test_pdf_handling.py` ### Changed - `get_content_from_url()` now detects and properly handles PDF content instead of piping binary through BeautifulSoup - Fixed return type annotation: `get_content_from_url() -> tuple[str, list[Document]]` ### Deprecated - N/A ### Removed - N/A ### Fixed - `fetch_url` tool now returns extracted text instead of garbled binary when fetching PDF URLs (#23841) - Image-only PDFs return a clear placeholder message instead of empty/corrupted content ### Security - SSRF protection via `validate_url()` before any PDF download - Streaming download (`stream=True`) with `Content-Length` header pre-check to reject oversized PDFs before buffering - 50 MB size limit enforced on both header and actual body size - `requests.Session` used as context manager to prevent connection leaks - Reuses existing SSL-verification and proxy settings from app config ### Breaking Changes - None --- ### Additional Information - `pypdf` is already a project dependency (`requirements.txt`: `pypdf==6.7.5`), used by the existing `PyPDFLoader` - The fix covers both sync (`get_content_from_url` → `loader.load()`) and async (`SafeWebBaseLoader._fetch()`) code paths - Run tests: `PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v` ### Screenshots or Videos N/A — backend-only change, no UI impact. ### Contributor License Agreement  - [x] By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-05-06 12:32:43 -05:00

GiteaMirror closed this issue

2026-05-06 12:32:46 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#66276