[GH-ISSUE #23841] issue: fetch_url tool call returns PDF files in binary, overwhelming most models #35614

Closed
opened 2026-04-25 09:47:13 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Davimalu on GitHub (Apr 17, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23841

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.8.12

Ollama Version (if applicable)

No response

Operating System

Debian 13

Browser (if applicable)

Firefox 149.0.2

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

The fetch_url method should detect that a PDF is being downloaded and pass the extracted text to the model instead of raw binary data

Actual Behavior

When a model calls fetch_url on a .pdf document or a website that redirects to a .pdf document, it will write the documents raw binary string into the models context window, e.g.

%PDF-1.5
%����
1 0 obj
<<
/Length 843       
/Filter /FlateDecode
>>
stream
x�mUMo�0��Wx���N�W����H��
Z�&��T���~3ڮ�z��y�87?�����n�k��N�ehܤ��=77U�\�;?:׺v�==��o��n�U����;O^u���u#���½��O
��ۍ�=٘�a�?���kLy�6F��/7��}��̽���][�H<Si��c�ݾk�^�90�j��YV����H^����v}0����<���rL���
��ͯ�_�/��Ck���B�n��y���W������THk����u��qö{s�\녚��"p]�Ϟќ��K�յ�u�/��A�	)`JbD>`���2���$`�TY'`�(Zq����BJŌ

This completely overwhelms most models and they stop their response

Steps to Reproduce

  1. Install Open WebUI v0.8.12 on Debian 13 via docker-compose
  2. Ask any model (e.g. GLM 5.1) to retrieve a PDF file from an URL (obviously this can happen randomly too when a model finds a pdf in a search_web tool call)
  3. fetch_url will retrieve the PDF and write the raw binary into the models context window
  4. Most models crash after this, especially if the PDF is big. Sometimes the model is able to recover and confirms it has just received gibberish
https://www.debian.org/releases/trixie/release-notes.en.pdf
Please summarize this document
I’m trying to debug an issue with fetch_url tool call. Please try to retrieve a pdf from the internet and see if you can read it

Logs & Screenshots

Image
Image
Image
Image

Additional Information

I was able to fix the issue by providing a custom implementation of fetch_url. This was generated using Claude Opus 4.6 and is probably not production ready, but I thought I'd leave it here as an example how the issue can be solved:

import json
import logging
import asyncio
import io
import socket
import ipaddress
import urllib.request
from urllib.parse import urlparse
from typing import Optional

from open_webui.retrieval.utils import get_content_from_url
from fastapi import Request

log = logging.getLogger(__name__)

# ---------------------------------------------------------------------------
# Security constants
# ---------------------------------------------------------------------------
_ALLOWED_SCHEMES = ("http", "https")

_BLOCKED_NETWORKS = [
    # IPv4 loopback, private, link-local, metadata, and reserved ranges
    ipaddress.ip_network("0.0.0.0/8"),
    ipaddress.ip_network("10.0.0.0/8"),
    ipaddress.ip_network("100.64.0.0/10"),
    ipaddress.ip_network("127.0.0.0/8"),
    ipaddress.ip_network("169.254.0.0/16"),
    ipaddress.ip_network("172.16.0.0/12"),
    ipaddress.ip_network("192.0.0.0/24"),
    ipaddress.ip_network("192.0.2.0/24"),
    ipaddress.ip_network("192.88.99.0/24"),
    ipaddress.ip_network("192.168.0.0/16"),
    ipaddress.ip_network("198.18.0.0/15"),
    ipaddress.ip_network("198.51.100.0/24"),
    ipaddress.ip_network("203.0.113.0/24"),
    ipaddress.ip_network("224.0.0.0/4"),
    ipaddress.ip_network("240.0.0.0/4"),
    ipaddress.ip_network("255.255.255.255/32"),
    # IPv6 loopback, private, link-local, and IPv4-mapped ranges
    ipaddress.ip_network("::1/128"),
    ipaddress.ip_network("fc00::/7"),
    ipaddress.ip_network("fe80::/10"),
    ipaddress.ip_network("::ffff:0:0/96"),
]

_MAX_DOWNLOAD_BYTES = 50 * 1024 * 1024  # 50 MB


# ---------------------------------------------------------------------------
# URL validation helper
# ---------------------------------------------------------------------------
def _validate_url(url: str) -> None:
    """Reject URLs that target private/reserved network addresses or use
    disallowed schemes (e.g. file://, ftp://, gopher://)."""
    parsed = urlparse(url)

    # --- Scheme allowlist (blocks file://, ftp://, gopher://, etc.) ---
    if parsed.scheme not in _ALLOWED_SCHEMES:
        raise ValueError(
            f"URL scheme '{parsed.scheme}' is not allowed. Only http and https are permitted."
        )

    hostname = parsed.hostname
    if not hostname:
        raise ValueError("URL does not contain a valid hostname.")

    # --- Resolve and check every address the hostname maps to ---
    try:
        port = parsed.port or (443 if parsed.scheme == "https" else 80)
        addr_infos = socket.getaddrinfo(hostname, port)
    except socket.gaierror:
        raise ValueError(f"Could not resolve hostname: {hostname}")

    for info in addr_infos:
        addr = ipaddress.ip_address(info[4][0])
        for network in _BLOCKED_NETWORKS:
            if addr in network:
                raise ValueError(
                    "Access to private or reserved network addresses is blocked."
                )


# ---------------------------------------------------------------------------
# Safe redirect handler — validates every hop
# ---------------------------------------------------------------------------
class _SafeRedirectHandler(urllib.request.HTTPRedirectHandler):
    """Applies the same URL validation to every 3xx redirect target so that
    an external redirect cannot bounce requests into internal networks."""

    def redirect_request(self, req, fp, code, msg, headers, newurl):
        _validate_url(newurl)
        return super().redirect_request(req, fp, code, msg, headers, newurl)


# ---------------------------------------------------------------------------
# Tool class
# ---------------------------------------------------------------------------
class Tools:
    def __init__(self):
        pass

    async def fetch_url(
        self,
        url: str,
        __request__: Request = None,
        __user__: dict = None,
    ) -> str:
        """
        Fetch and extract text content from a web page or a public PDF URL.

        :param url: The URL to fetch content from
        :return: The extracted text content from the page or PDF
        """
        if __request__ is None:
            return json.dumps({"error": "Request context not available"})

        # -----------------------------------------------------------------
        # Inner helper: sniff Content-Type and parse PDFs
        # -----------------------------------------------------------------
        def _detect_and_parse_pdf(target_url: str) -> dict:
            # Validate the user-supplied URL BEFORE any network I/O
            _validate_url(target_url)

            # Build an opener that validates every redirect hop
            opener = urllib.request.build_opener(_SafeRedirectHandler)
            req = urllib.request.Request(
                target_url, headers={"User-Agent": "Mozilla/5.0"}
            )

            with opener.open(req, timeout=30) as response:
                # Re-validate the final URL after redirects
                # (mitigates DNS-rebinding on redirect chains)
                _validate_url(response.url)

                content_type = response.headers.get("Content-Type", "").lower()
                final_url = response.url

                is_pdf = "application/pdf" in content_type or urlparse(
                    final_url
                ).path.lower().endswith(".pdf")

                if is_pdf:
                    from pypdf import PdfReader

                    # Read with a hard size cap to prevent memory exhaustion
                    pdf_bytes = response.read(_MAX_DOWNLOAD_BYTES + 1)
                    if len(pdf_bytes) > _MAX_DOWNLOAD_BYTES:
                        raise ValueError(
                            f"PDF exceeds the maximum allowed size of "
                            f"{_MAX_DOWNLOAD_BYTES // (1024 * 1024)} MB."
                        )

                    reader = PdfReader(io.BytesIO(pdf_bytes))
                    extracted_text = []
                    for page in reader.pages:
                        text = page.extract_text()
                        if text:
                            extracted_text.append(text)
                    return {"is_pdf": True, "content": "\n".join(extracted_text)}
                else:
                    return {"is_pdf": False, "url": final_url}

        # -----------------------------------------------------------------
        # Main flow
        # -----------------------------------------------------------------
        try:
            # 1. Attempt to detect content type and parse PDFs
            try:
                result = await asyncio.to_thread(_detect_and_parse_pdf, url)
            except ValueError:
                # Validation errors must not be swallowed
                raise
            except Exception as e:
                log.warning(
                    f"Initial URL header check failed: {e}. "
                    "Falling back to default handler."
                )
                if urlparse(url).path.lower().endswith(".pdf"):
                    raise
                # Validate before falling back to default handler
                _validate_url(url)
                result = {"is_pdf": False, "url": url}

            # 2. Handle result
            if result.get("is_pdf"):
                content = result["content"]
            else:
                final_html_url = result.get("url", url)
                # Validate the final URL one more time before handing off
                _validate_url(final_html_url)
                content, _ = await asyncio.to_thread(
                    get_content_from_url, __request__, final_html_url
                )

            # 3. Truncate if configured
            max_length = getattr(
                __request__.app.state.config, "WEB_FETCH_MAX_CONTENT_LENGTH", None
            )
            if max_length and max_length > 0 and len(content) > max_length:
                content = content[:max_length] + "\n\n[Content truncated...]"

            return content

        except ValueError as e:
            # Validation errors — safe to surface to the caller
            log.warning(f"fetch_url validation error: {e}")
            return json.dumps({"error": str(e)})
        except Exception as e:
            # Everything else — log detail internally, return generic message
            log.exception(f"fetch_url error: {e}")
            return json.dumps({"error": "Failed to fetch the requested URL."})
Originally created by @Davimalu on GitHub (Apr 17, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23841 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.8.12 ### Ollama Version (if applicable) _No response_ ### Operating System Debian 13 ### Browser (if applicable) Firefox 149.0.2 ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior The `fetch_url` method should detect that a PDF is being downloaded and pass the extracted text to the model instead of raw binary data ### Actual Behavior When a model calls `fetch_url` on a .pdf document or a website that redirects to a .pdf document, it will write the documents raw binary string into the models context window, e.g. ``` %PDF-1.5 %���� 1 0 obj << /Length 843 /Filter /FlateDecode >> stream x�mUMo�0��Wx���N�W����H�� Z�&��T���~3ڮ�z��y�87?�����n�k��N�ehܤ��=77U�\�;?:׺v�==��o��n�U����;O^u���u#���½��O ��ۍ�=٘�a�?���kLy�6F��/7��}��̽���][�H<Si��c�ݾk�^�90�j��YV����H^����v}0����<���rL��� ��ͯ�_�/��Ck���B�n��y���W������THk����u��qö{s�\녚��"p]�Ϟќ��K�յ� u�/��A� )`JbD>`���2���$`�TY'`�(Zq����BJŌ ``` This completely overwhelms most models and they stop their response ### Steps to Reproduce 1. Install Open WebUI v0.8.12 on Debian 13 via docker-compose 2. Ask any model (e.g. GLM 5.1) to retrieve a PDF file from an URL (obviously this can happen randomly too when a model finds a pdf in a `search_web` tool call) 3. `fetch_url` will retrieve the PDF and write the raw binary into the models context window 4. Most models crash after this, especially if the PDF is big. Sometimes the model is able to recover and confirms it has just received gibberish ``` https://www.debian.org/releases/trixie/release-notes.en.pdf Please summarize this document ``` ``` I’m trying to debug an issue with fetch_url tool call. Please try to retrieve a pdf from the internet and see if you can read it ``` ### Logs & Screenshots <img width="1031" height="671" alt="Image" src="https://github.com/user-attachments/assets/d96b1442-ce8f-4cce-9fc6-dd06ff56ea40" /> <hr> <img width="1015" height="264" alt="Image" src="https://github.com/user-attachments/assets/d81b272c-d47b-4823-9b72-4a11f407fa51" /> <hr> <img width="1017" height="390" alt="Image" src="https://github.com/user-attachments/assets/2b500238-1c1d-431a-902d-d1c0610238d0" /> <hr> <img width="1005" height="525" alt="Image" src="https://github.com/user-attachments/assets/7c9dca1a-27ad-4e86-8682-c4a953dbe959" /> ### Additional Information I was able to fix the issue by providing a custom implementation of `fetch_url`. This was generated using Claude Opus 4.6 and is probably not production ready, but I thought I'd leave it here as an example how the issue can be solved: ```python import json import logging import asyncio import io import socket import ipaddress import urllib.request from urllib.parse import urlparse from typing import Optional from open_webui.retrieval.utils import get_content_from_url from fastapi import Request log = logging.getLogger(__name__) # --------------------------------------------------------------------------- # Security constants # --------------------------------------------------------------------------- _ALLOWED_SCHEMES = ("http", "https") _BLOCKED_NETWORKS = [ # IPv4 loopback, private, link-local, metadata, and reserved ranges ipaddress.ip_network("0.0.0.0/8"), ipaddress.ip_network("10.0.0.0/8"), ipaddress.ip_network("100.64.0.0/10"), ipaddress.ip_network("127.0.0.0/8"), ipaddress.ip_network("169.254.0.0/16"), ipaddress.ip_network("172.16.0.0/12"), ipaddress.ip_network("192.0.0.0/24"), ipaddress.ip_network("192.0.2.0/24"), ipaddress.ip_network("192.88.99.0/24"), ipaddress.ip_network("192.168.0.0/16"), ipaddress.ip_network("198.18.0.0/15"), ipaddress.ip_network("198.51.100.0/24"), ipaddress.ip_network("203.0.113.0/24"), ipaddress.ip_network("224.0.0.0/4"), ipaddress.ip_network("240.0.0.0/4"), ipaddress.ip_network("255.255.255.255/32"), # IPv6 loopback, private, link-local, and IPv4-mapped ranges ipaddress.ip_network("::1/128"), ipaddress.ip_network("fc00::/7"), ipaddress.ip_network("fe80::/10"), ipaddress.ip_network("::ffff:0:0/96"), ] _MAX_DOWNLOAD_BYTES = 50 * 1024 * 1024 # 50 MB # --------------------------------------------------------------------------- # URL validation helper # --------------------------------------------------------------------------- def _validate_url(url: str) -> None: """Reject URLs that target private/reserved network addresses or use disallowed schemes (e.g. file://, ftp://, gopher://).""" parsed = urlparse(url) # --- Scheme allowlist (blocks file://, ftp://, gopher://, etc.) --- if parsed.scheme not in _ALLOWED_SCHEMES: raise ValueError( f"URL scheme '{parsed.scheme}' is not allowed. Only http and https are permitted." ) hostname = parsed.hostname if not hostname: raise ValueError("URL does not contain a valid hostname.") # --- Resolve and check every address the hostname maps to --- try: port = parsed.port or (443 if parsed.scheme == "https" else 80) addr_infos = socket.getaddrinfo(hostname, port) except socket.gaierror: raise ValueError(f"Could not resolve hostname: {hostname}") for info in addr_infos: addr = ipaddress.ip_address(info[4][0]) for network in _BLOCKED_NETWORKS: if addr in network: raise ValueError( "Access to private or reserved network addresses is blocked." ) # --------------------------------------------------------------------------- # Safe redirect handler — validates every hop # --------------------------------------------------------------------------- class _SafeRedirectHandler(urllib.request.HTTPRedirectHandler): """Applies the same URL validation to every 3xx redirect target so that an external redirect cannot bounce requests into internal networks.""" def redirect_request(self, req, fp, code, msg, headers, newurl): _validate_url(newurl) return super().redirect_request(req, fp, code, msg, headers, newurl) # --------------------------------------------------------------------------- # Tool class # --------------------------------------------------------------------------- class Tools: def __init__(self): pass async def fetch_url( self, url: str, __request__: Request = None, __user__: dict = None, ) -> str: """ Fetch and extract text content from a web page or a public PDF URL. :param url: The URL to fetch content from :return: The extracted text content from the page or PDF """ if __request__ is None: return json.dumps({"error": "Request context not available"}) # ----------------------------------------------------------------- # Inner helper: sniff Content-Type and parse PDFs # ----------------------------------------------------------------- def _detect_and_parse_pdf(target_url: str) -> dict: # Validate the user-supplied URL BEFORE any network I/O _validate_url(target_url) # Build an opener that validates every redirect hop opener = urllib.request.build_opener(_SafeRedirectHandler) req = urllib.request.Request( target_url, headers={"User-Agent": "Mozilla/5.0"} ) with opener.open(req, timeout=30) as response: # Re-validate the final URL after redirects # (mitigates DNS-rebinding on redirect chains) _validate_url(response.url) content_type = response.headers.get("Content-Type", "").lower() final_url = response.url is_pdf = "application/pdf" in content_type or urlparse( final_url ).path.lower().endswith(".pdf") if is_pdf: from pypdf import PdfReader # Read with a hard size cap to prevent memory exhaustion pdf_bytes = response.read(_MAX_DOWNLOAD_BYTES + 1) if len(pdf_bytes) > _MAX_DOWNLOAD_BYTES: raise ValueError( f"PDF exceeds the maximum allowed size of " f"{_MAX_DOWNLOAD_BYTES // (1024 * 1024)} MB." ) reader = PdfReader(io.BytesIO(pdf_bytes)) extracted_text = [] for page in reader.pages: text = page.extract_text() if text: extracted_text.append(text) return {"is_pdf": True, "content": "\n".join(extracted_text)} else: return {"is_pdf": False, "url": final_url} # ----------------------------------------------------------------- # Main flow # ----------------------------------------------------------------- try: # 1. Attempt to detect content type and parse PDFs try: result = await asyncio.to_thread(_detect_and_parse_pdf, url) except ValueError: # Validation errors must not be swallowed raise except Exception as e: log.warning( f"Initial URL header check failed: {e}. " "Falling back to default handler." ) if urlparse(url).path.lower().endswith(".pdf"): raise # Validate before falling back to default handler _validate_url(url) result = {"is_pdf": False, "url": url} # 2. Handle result if result.get("is_pdf"): content = result["content"] else: final_html_url = result.get("url", url) # Validate the final URL one more time before handing off _validate_url(final_html_url) content, _ = await asyncio.to_thread( get_content_from_url, __request__, final_html_url ) # 3. Truncate if configured max_length = getattr( __request__.app.state.config, "WEB_FETCH_MAX_CONTENT_LENGTH", None ) if max_length and max_length > 0 and len(content) > max_length: content = content[:max_length] + "\n\n[Content truncated...]" return content except ValueError as e: # Validation errors — safe to surface to the caller log.warning(f"fetch_url validation error: {e}") return json.dumps({"error": str(e)}) except Exception as e: # Everything else — log detail internally, return generic message log.exception(f"fetch_url error: {e}") return json.dumps({"error": "Failed to fetch the requested URL."}) ```
GiteaMirror added the bug label 2026-04-25 09:47:13 -05:00
Author
Owner

@tjbck commented on GitHub (Apr 21, 2026):

Addressed in dev.

<!-- gh-comment-id:4286461736 --> @tjbck commented on GitHub (Apr 21, 2026): Addressed in dev.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#35614