[PR #23869] [CLOSED] fix: extract text from PDF URLs in fetch_url tool #50465

Closed
opened 2026-04-30 03:11:08 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/23869
Author: @gaurav0107
Created: 4/18/2026
Status: Closed

Base: mainHead: fix/fetch-url-pdf-handling


📝 Commits (1)

  • 1c22f06 fix: extract text from PDF URLs in fetch_url tool

📊 Changes

3 files changed (+440 additions, -2 deletions)

View changed files

📝 backend/open_webui/retrieval/utils.py (+101 -2)
📝 backend/open_webui/retrieval/web/utils.py (+26 -0)
backend/tests/retrieval/test_pdf_handling.py (+313 -0)

📄 Description

Summary

Fixes #23841 — When fetch_url fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes.

This PR adds PDF-aware text extraction with three detection layers:

  • Fast-path: URLs ending in .pdf (case-insensitive) bypass the web loader entirely and extract text directly with pypdf
  • Content-Type detection: The async SafeWebBaseLoader._fetch() checks the Content-Type: application/pdf header for dynamic download URLs
  • Fallback: If the web loader returns content starting with %PDF, re-downloads and extracts with pypdf

Security & resource controls

  • SSRF protection via validate_url() before any download
  • Streaming download (stream=True) with Content-Length header pre-check to reject oversized PDFs before buffering
  • 50 MB size limit enforced on both header and actual body
  • requests.Session used as context manager (no connection leaks)
  • Reuses existing SSL-verification and proxy settings

No new dependencies

pypdf is already pinned in requirements.txt (pypdf==6.7.5) and used by the existing PyPDFLoader.

Test plan

  • 12 unit tests covering all new/modified functions
  • Valid PDF text extraction
  • Blank/image-only PDFs return placeholder message
  • Corrupted/truncated PDFs raise exceptions
  • Oversized PDF rejection (both Content-Length and body size)
  • .pdf URL fast-path routing (case-insensitive)
  • %PDF binary content fallback detection
  • Error fallback returns user-friendly message
  • Non-PDF URLs work unchanged

Run tests: PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/23869 **Author:** [@gaurav0107](https://github.com/gaurav0107) **Created:** 4/18/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/fetch-url-pdf-handling` --- ### 📝 Commits (1) - [`1c22f06`](https://github.com/open-webui/open-webui/commit/1c22f06b98031c9629524d856d18c0ef3d1c08fa) fix: extract text from PDF URLs in fetch_url tool ### 📊 Changes **3 files changed** (+440 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/utils.py` (+101 -2) 📝 `backend/open_webui/retrieval/web/utils.py` (+26 -0) ➕ `backend/tests/retrieval/test_pdf_handling.py` (+313 -0) </details> ### 📄 Description ## Summary Fixes #23841 — When `fetch_url` fetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction with three detection layers: - **Fast-path**: URLs ending in `.pdf` (case-insensitive) bypass the web loader entirely and extract text directly with `pypdf` - **Content-Type detection**: The async `SafeWebBaseLoader._fetch()` checks the `Content-Type: application/pdf` header for dynamic download URLs - **Fallback**: If the web loader returns content starting with `%PDF`, re-downloads and extracts with `pypdf` ### Security & resource controls - SSRF protection via `validate_url()` before any download - Streaming download (`stream=True`) with `Content-Length` header pre-check to reject oversized PDFs before buffering - 50 MB size limit enforced on both header and actual body - `requests.Session` used as context manager (no connection leaks) - Reuses existing SSL-verification and proxy settings ### No new dependencies `pypdf` is already pinned in `requirements.txt` (`pypdf==6.7.5`) and used by the existing `PyPDFLoader`. ## Test plan - [x] 12 unit tests covering all new/modified functions - [x] Valid PDF text extraction - [x] Blank/image-only PDFs return placeholder message - [x] Corrupted/truncated PDFs raise exceptions - [x] Oversized PDF rejection (both Content-Length and body size) - [x] `.pdf` URL fast-path routing (case-insensitive) - [x] `%PDF` binary content fallback detection - [x] Error fallback returns user-friendly message - [x] Non-PDF URLs work unchanged Run tests: `PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 03:11:08 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#50465