feat: Option not to read specific webpage content after performing a web search #4362

Closed
opened 2025-11-11 15:52:16 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @williamgateszhao on GitHub (Mar 10, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.

Problem Description

When executing process_web_search in open-webui/backend/open_webui/routers/retrieval.py, search_web is called to retrieve web_results, which contain snippet. For some search engine results, the snippet is already a processed, complete, or relatively complete webpage content, rather than just a brief summary of the webpage. However, these snippet are directly discarded.

Subsequently, process_web_search uses the default web_loader to access the webpage again to obtain its content as docs. In the above scenario, this is a waste of time and resources, and the scraping of the webpage may not necessarily be better than that provided by professional search providers like Jina.

For example, the snippet generated by jina_search.py includes the complete webpage content processed by Jina in markdown format.

Another example is tavily.py, which can actually obtain the raw_content processed by Tavily in the return value by adding "include_raw_content": true to the post data.

Desired Solution you'd like

In the above situation, there is no need to call get_web_loader to visit each URL individually. Instead, the snippet in web_results can be directly used as docs in the return value of process_web_search.

I suggest to add an option allowing users to decide this. I guess this might also address the requirement mentioned in #11488.

Alternatives Considered

No response

Additional Context

No response

Originally created by @williamgateszhao on GitHub (Mar 10, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description When executing `process_web_search` in `open-webui/backend/open_webui/routers/retrieval.py`, search_web is called to retrieve web_results, which contain `snippet`. For some search engine results, the snippet is already a processed, complete, or relatively complete webpage content, rather than just a brief summary of the webpage. However, these `snippet` are directly discarded. Subsequently, `process_web_search` uses the default `web_loader` to access the webpage again to obtain its content as `docs`. In the above scenario, this is a waste of time and resources, and the scraping of the webpage may not necessarily be better than that provided by professional search providers like Jina. For example, the `snippet` generated by `jina_search.py` includes the complete webpage content processed by Jina in markdown format. Another example is `tavily.py`, which can actually obtain the `raw_content` processed by Tavily in the return value by adding `"include_raw_content": true` to the post data. ### Desired Solution you'd like In the above situation, there is no need to call get_web_loader to visit each URL individually. Instead, the `snippet` in web_results can be directly used as `docs` in the return value of `process_web_search`. I suggest to add an option allowing users to decide this. I guess this might also address the requirement mentioned in #11488. ### Alternatives Considered _No response_ ### Additional Context _No response_
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#4362