[PR #23874] [CLOSED] fix: extract text from PDF URLs in fetch_url tool #66276

Closed
opened 2026-05-06 12:32:43 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/23874
Author: @gaurav0107
Created: 4/19/2026
Status: Closed

Base: devHead: fix/fetch-url-pdf-handling


📝 Commits (1)

  • 9d3d80b fix: extract text from PDF URLs in fetch_url tool

📊 Changes

3 files changed (+485 additions, -2 deletions)

View changed files

📝 backend/open_webui/retrieval/utils.py (+109 -2)
📝 backend/open_webui/retrieval/web/utils.py (+33 -0)
backend/tests/retrieval/test_pdf_handling.py (+343 -0)

📄 Description

Pull Request Checklist

  • Target branch: Verify that the pull request targets the dev branch.
  • Description: Provided below.
  • Changelog: Included below.
  • Dependencies: No new dependencies. pypdf is already pinned in requirements.txt (pypdf==6.7.5).
  • Testing: 12 unit tests included. Manual verification performed.
  • Agentic AI Code: This PR has gone through human review and manual testing.
  • Code review: Self-reviewed. Follows existing project patterns.
  • Git Hygiene: Atomic PR — one logical change, rebased on dev.

Changelog Entry

Description

When fetch_url fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction so that PDF URLs return clean, readable text.

Closes #23841

Added

  • extract_text_from_pdf_bytes() — extracts text from raw PDF bytes using pypdf
  • extract_pdf_from_url() — downloads a PDF and extracts text with SSRF protection, streaming size limits, and proper session management
  • Three-layer PDF detection in the web retrieval pipeline:
    • Fast-path for URLs ending in .pdf (case-insensitive)
    • Content-Type: application/pdf header detection in async SafeWebBaseLoader._fetch()
    • %PDF binary content fallback in get_content_from_url()
  • 12 unit tests in backend/tests/retrieval/test_pdf_handling.py

Changed

  • get_content_from_url() now detects and properly handles PDF content instead of piping binary through BeautifulSoup
  • Fixed return type annotation: get_content_from_url() -> tuple[str, list[Document]]

Deprecated

  • N/A

Removed

  • N/A

Fixed

  • fetch_url tool now returns extracted text instead of garbled binary when fetching PDF URLs (#23841)
  • Image-only PDFs return a clear placeholder message instead of empty/corrupted content

Security

  • SSRF protection via validate_url() before any PDF download
  • Streaming download (stream=True) with Content-Length header pre-check to reject oversized PDFs before buffering
  • 50 MB size limit enforced on both header and actual body size
  • requests.Session used as context manager to prevent connection leaks
  • Reuses existing SSL-verification and proxy settings from app config

Breaking Changes

  • None

Additional Information

  • pypdf is already a project dependency (requirements.txt: pypdf==6.7.5), used by the existing PyPDFLoader
  • The fix covers both sync (get_content_from_urlloader.load()) and async (SafeWebBaseLoader._fetch()) code paths
  • Run tests: PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v

Screenshots or Videos

N/A — backend-only change, no UI impact.

Contributor License Agreement

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/23874 **Author:** [@gaurav0107](https://github.com/gaurav0107) **Created:** 4/19/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/fetch-url-pdf-handling` --- ### 📝 Commits (1) - [`9d3d80b`](https://github.com/open-webui/open-webui/commit/9d3d80b9f8df851d475dbd1951cb34360920b80e) fix: extract text from PDF URLs in fetch_url tool ### 📊 Changes **3 files changed** (+485 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/utils.py` (+109 -2) 📝 `backend/open_webui/retrieval/web/utils.py` (+33 -0) ➕ `backend/tests/retrieval/test_pdf_handling.py` (+343 -0) </details> ### 📄 Description <!-- ⚠️ CRITICAL CHECKS FOR CONTRIBUTORS (READ, DON'T DELETE) ⚠️ 1. Target the `dev` branch. PRs targeting `main` will be automatically closed. 2. Do NOT delete the CLA section at the bottom. It is required for the bot to accept your PR. --> # Pull Request Checklist - [x] **Target branch:** Verify that the pull request targets the `dev` branch. - [x] **Description:** Provided below. - [x] **Changelog:** Included below. - [x] **Dependencies:** No new dependencies. `pypdf` is already pinned in `requirements.txt` (`pypdf==6.7.5`). - [x] **Testing:** 12 unit tests included. Manual verification performed. - [x] **Agentic AI Code:** This PR has gone through human review and manual testing. - [x] **Code review:** Self-reviewed. Follows existing project patterns. - [x] **Git Hygiene:** Atomic PR — one logical change, rebased on `dev`. # Changelog Entry ### Description When `fetch_url` fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction so that PDF URLs return clean, readable text. Closes #23841 ### Added - `extract_text_from_pdf_bytes()` — extracts text from raw PDF bytes using `pypdf` - `extract_pdf_from_url()` — downloads a PDF and extracts text with SSRF protection, streaming size limits, and proper session management - Three-layer PDF detection in the web retrieval pipeline: - Fast-path for URLs ending in `.pdf` (case-insensitive) - `Content-Type: application/pdf` header detection in async `SafeWebBaseLoader._fetch()` - `%PDF` binary content fallback in `get_content_from_url()` - 12 unit tests in `backend/tests/retrieval/test_pdf_handling.py` ### Changed - `get_content_from_url()` now detects and properly handles PDF content instead of piping binary through BeautifulSoup - Fixed return type annotation: `get_content_from_url() -> tuple[str, list[Document]]` ### Deprecated - N/A ### Removed - N/A ### Fixed - `fetch_url` tool now returns extracted text instead of garbled binary when fetching PDF URLs (#23841) - Image-only PDFs return a clear placeholder message instead of empty/corrupted content ### Security - SSRF protection via `validate_url()` before any PDF download - Streaming download (`stream=True`) with `Content-Length` header pre-check to reject oversized PDFs before buffering - 50 MB size limit enforced on both header and actual body size - `requests.Session` used as context manager to prevent connection leaks - Reuses existing SSL-verification and proxy settings from app config ### Breaking Changes - None --- ### Additional Information - `pypdf` is already a project dependency (`requirements.txt`: `pypdf==6.7.5`), used by the existing `PyPDFLoader` - The fix covers both sync (`get_content_from_url` → `loader.load()`) and async (`SafeWebBaseLoader._fetch()`) code paths - Run tests: `PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v` ### Screenshots or Videos N/A — backend-only change, no UI impact. ### Contributor License Agreement <!-- 🚨 DO NOT DELETE THE TEXT BELOW 🚨 Keep the "Contributor License Agreement" confirmation text intact. Deleting it will trigger the CLA-Bot to INVALIDATE your PR. Your PR will NOT be reviewed or merged until you check the box below confirming that you have read and agree to the terms of the CLA. --> - [x] By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. > [!NOTE] > Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 12:32:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#66276