mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[GH-ISSUE #7002] HTTP Proxy Ignored for Web Scraping in Open WebUI (SearXNG Integration) #30092
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ips972 on GitHub (Nov 18, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/7002
Bug Report: HTTP Proxy Ignored for Web Scraping in Open WebUI (SearXNG Integration)
Description
When using SearXNG with Open WebUI, the initial connection to SearXNG successfully retrieves JSON search results. However, the subsequent web scraping process that fetches content from URLs within the results does not respect the configured HTTP proxy settings. This causes failures in environments where direct internet access is blocked, and a proxy is mandatory for outbound connections.
Despite attempting to configure HTTP proxy settings in various ways (e.g., environment variables, Docker options, and application-level configurations), the scraper fails to use the proxy. Upon inspecting the relevant Python file responsible for downloading URLs, it appears that proxy support is either not implemented or incorrectly handled.
Steps to Reproduce
Deploy Open WebUI with SearXNG integration enabled.
both using http_proxy correctly - open webui able to download models from HF, and seaxNG able to search the web.
searxNG proxy setting given in it config file yaml and works fine - either api call and manual user use work great.
enabled json response in settings. this also works fine.
here is the tcpdump (parcial) result from searxNG to open webui: question asked : whats new in ai?
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 27329
Server-Timing: total;dur=1622.189, render;dur=0, total_0_google;dur=697.018, total_1_wikipedia;dur=763.253, total_2_brave;dur=912.054, total_3_duckduckgo;dur=1026.366, total_4_qwant;dur=1347.023, total_5_wikidata;dur=1599.436, load_0_google;dur=643.497, load_1_wikipedia;dur=747.168, load_2_brave;dur=837.748, load_3_duckduckgo;dur=1006.767, load_4_qwant;dur=1300.943, load_5_wikidata;dur=1591.216
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Robots-Tag: noindex, nofollow
Referrer-Policy: no-referrer
Connection: close
{"query": "latest advancements in artificial intelligence 2024", "number_of_results": 0, "results": [{"url": "https://www.technologyreview.com/2024/01/04/1086046/whats-next-for-ai-in-2024/", "title": "What's next for AI in 2024", "content": "January 8, 2024 - In 2024, generative AI might actually become useful for the regular, non-tech person, and we are going to see more people tinkering with a million little AI models. State-of-the-art AI models, such as GPT-4 and Gemini, are multimodal, meaning they can process not only text but images and even ...", "thumbnail": "", "engine": "google", "parsed_url": ["https", "www.technologyreview.com", "/2024/01/04/1086046/whats-next-for-ai-in-2024/", "", "", ""], "template": "default.html", "engines": ["qwant", "google", "brave"], "positions": [7, 2, 1], "publishedDate": "2024-01-08T00:00:00", "score": 4.928571428571429, "category": "general"}, {"url": "https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends", "title": "10 top AI and machine learning...........
Configure HTTP proxy settings:
Add http_proxy and https_proxy environment variables in the system.
Pass the proxy configuration via Docker:
bash
Copy code
docker run -d -p 3000:8080 --gpus all
-e http_proxy=http://proxy.example.com:3128
-e https_proxy=http://proxy.example.com:3128
--name open-webui
ghcr.io/open-webui/open-webui:cuda
Include proxy settings in the Open WebUI configuration (if applicable).
Perform a search in Open WebUI using SearXNG.
Observe that:
SearXNG returns the JSON response with search results.
The subsequent web scraping fails to retrieve content from the URLs due to lack of proxy usage.
Expected Behavior
The scraper should respect HTTP proxy settings and use the specified proxy for all outbound connections, including fetching content from the URLs returned in the SearXNG JSON results.
Actual Behavior
The scraper attempts direct internet connections, bypassing the configured HTTP proxy. As a result, web scraping fails in environments with restricted direct internet access. since no DNS resolve in the network is available. only http proxy browsing is allowed.
Additional Information
Configuration Attempts:
Tried setting http_proxy and https_proxy as environment variables.
Passed proxy settings during Docker container creation.
Manually modified the scraper's Python code to hardcode proxy settings.
Code Inspection:
The Python file responsible for downloading URLs appears not to include proxy support.
Observed the absence of proxy-related options in the code paths leading to URL fetching.
Environment Details:
Open WebUI version: v0.3.35 (main or cuda - same results)
SearXNG integration: Enabled
Deployment method: Docker
Proxy type: HTTP
Proposed Solution
Implement HTTP proxy support in the web scraping module.
Ensure the proxy settings are inherited from environment variables or explicitly passed through configurations.
Provide documentation on how to configure proxies for the scraping service.
Impact
This bug limits the functionality of Open WebUI in secure environments, preventing users from leveraging SearXNG effectively. Addressing this issue is critical for use cases where proxies are a standard requirement.
debug logs:
ERROR [open_webui.apps.retrieval.main] [Errno -3] Temporary failure in name resolution
Traceback (most recent call last):
File "/app/backend/open_webui/apps/retrieval/main.py", line 1165, in process_web_search
loader = get_web_loader(urls)
^^^^^^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/web/utils.py", line 90, in get_web_loader
if not validate_url(url):
^^^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/web/utils.py", line 41, in validate_url
return all(validate_url(u) for u in url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/web/utils.py", line 41, in
return all(validate_url(u) for u in url)
^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/web/utils.py", line 30, in validate_url
ipv4_addresses, ipv6_addresses = resolve_hostname(parsed_url.hostname)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/backend/open_webui/apps/retrieval/web/utils.py", line 48, in resolve_hostname
addr_info = socket.getaddrinfo(hostname, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/socket.py", line 974, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution