[GH-ISSUE #2617] enh: playwright/selenium web search support for RAG #28475

Closed
opened 2026-04-25 03:05:28 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @tjbck on GitHub (May 28, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/2617

Originally created by @tjbck on GitHub (May 28, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/2617
Author
Owner

@noperator commented on GitHub (Jan 5, 2025):

Is the idea here that, instead of just using langchain_community.document_loaders.WebBaseLoader._scrape() (below), we also optionally use something like AsyncChromiumLoader?

4bc9904b3c/backend/open_webui/retrieval/web/utils.py (L57-L64)

In addition to loading results dynamically, I'm also interested in adding support for:

<!-- gh-comment-id:2571458995 --> @noperator commented on GitHub (Jan 5, 2025): Is the idea here that, instead of just using `langchain_community.document_loaders.WebBaseLoader._scrape()` (below), we also optionally use something like [`AsyncChromiumLoader`](https://python.langchain.com/docs/integrations/document_loaders/async_chromium/)? https://github.com/open-webui/open-webui/blob/4bc9904b3cd0726d3f9c3cbaeade972cf167b6c4/backend/open_webui/retrieval/web/utils.py#L57-L64 In addition to loading results dynamically, I'm also interested in adding support for: - loading results from archive sites: https://web.archive.org/, https://archive.ph/ - extracting the "main text of a site" (e.g., with [Trafilatura](https://github.com/adbar/trafilatura#:~:text=Robust%20and%20configurable%20extraction%20of%20key%20elements%3A)) - using [multiple queries](https://github.com/open-webui/open-webui/issues/7876)
Author
Owner

@roryeckel commented on GitHub (Jan 28, 2025):

I am testing some code changes here. First roadblock with AsyncChromiumLoader is its internal usage of asyncio.run is causing "ERROR: asyncio.run() cannot be called from a running event loop"

I think PlaywrightURLLoader may be a better alternative because it has both sync and async implementations specific to Playwright.
However when I tried it, I also got "playwright._impl._errors.Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead."

Perhaps retrieval.py needs process_web_search to be refactored to async. However when I tried this, the event emitter events didn't go in order properly. Oof, can't catch a break, but I'll keep trying

Edit: managed to get a draft working & still cleaning up

<!-- gh-comment-id:2617925471 --> @roryeckel commented on GitHub (Jan 28, 2025): I am testing some code changes here. First roadblock with AsyncChromiumLoader is its internal usage of asyncio.run is causing "ERROR: asyncio.run() cannot be called from a running event loop" I think [PlaywrightURLLoader](https://api.python.langchain.com/en/latest/community/document_loaders/langchain_community.document_loaders.url_playwright.PlaywrightURLLoader.html) may be a better alternative because it has both sync and async implementations specific to Playwright. However when I tried it, I also got "playwright._impl._errors.Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead." Perhaps retrieval.py needs process_web_search to be refactored to async. However when I tried this, the event emitter events didn't go in order properly. Oof, can't catch a break, but I'll keep trying Edit: managed to get a draft working & still cleaning up
Author
Owner

@roryeckel commented on GitHub (Jan 29, 2025):

Please see my progress, I'm open to feedback. Will create PR soon

It is functioning great for me here. With my added URL validation, it seems to be more stable even in the old mode as well. 4e8b390682

TODO for myself: update documentation for environment variable "RAG_WEB_LOADER", fix internationalization I forgot about, and test docker

<!-- gh-comment-id:2620730259 --> @roryeckel commented on GitHub (Jan 29, 2025): Please see my progress, I'm open to feedback. Will create PR soon It is functioning great for me here. With my added URL validation, it seems to be more stable even in the old mode as well. https://github.com/roryeckel/open-webui/commit/4e8b3906821a9a10f4fd0038373291dff41b65cf TODO for myself: update documentation for environment variable "RAG_WEB_LOADER", fix internationalization I forgot about, and test docker
Author
Owner

@roryeckel commented on GitHub (Jan 31, 2025):

Fixed the internationalization and cleaned up a bit.
However I did test docker and it does not work, 403s getting thrown by Playwright. I will investigate that

https://github.com/roryeckel/open-webui

Update: it seems like the 403 error is actually coming from download_nltk_packages. However, the startup script has already been adjusted to install the nltk package if the environment variable is set.
Error is caused by this access denial: https://utic-public-cf.s3.amazonaws.com/nltk_data_3.8.2.tar.gz
Our unstructured package is far too out of date, they revamped the downloading system to use a different method and URL

Also, I'd like to move the chromium dep installation into the data cache folder

<!-- gh-comment-id:2626347018 --> @roryeckel commented on GitHub (Jan 31, 2025): Fixed the internationalization and cleaned up a bit. However I did test docker and it does not work, 403s getting thrown by Playwright. I will investigate that https://github.com/roryeckel/open-webui Update: it seems like the 403 error is actually coming from download_nltk_packages. However, the startup script has already been adjusted to install the nltk package if the environment variable is set. Error is caused by this access denial: https://utic-public-cf.s3.amazonaws.com/nltk_data_3.8.2.tar.gz Our unstructured package is far too out of date, they revamped the downloading system to use a different method and URL Also, I'd like to move the chromium dep installation into the data cache folder
Author
Owner

@roryeckel commented on GitHub (Feb 1, 2025):

I've written documentation for my feature: 452c447edc

<!-- gh-comment-id:2628798068 --> @roryeckel commented on GitHub (Feb 1, 2025): I've written documentation for my feature: https://github.com/roryeckel/open-webui-docs/commit/452c447edc1fd25d90b9f06e6d40108527b8c053
Author
Owner

@roryeckel commented on GitHub (Feb 2, 2025):

After updating the unstructured package, Playwright + unstructured web loader is functioning in docker.
I tested by using the official docker compose with environment "RAG_WEB_LOADER=playwright" set on my branch:

https://github.com/roryeckel/open-webui
https://github.com/roryeckel/open-webui-docs

May someone please share the next steps for me to take to get this reviewed? Should I create a PR? Thanks in advance!

<!-- gh-comment-id:2629243901 --> @roryeckel commented on GitHub (Feb 2, 2025): After updating the unstructured package, Playwright + unstructured web loader is functioning in docker. I tested by using the official docker compose with environment "RAG_WEB_LOADER=playwright" set on my branch: https://github.com/roryeckel/open-webui https://github.com/roryeckel/open-webui-docs May someone please share the next steps for me to take to get this reviewed? Should I create a PR? Thanks in advance!
Author
Owner

@roryeckel commented on GitHub (Feb 2, 2025):

After discussion in discord, we want to allow a secondary docker container to serve the playwright API for increased security and bandwidth efficiency. I will make these changes to enable both modes ASAP

docker run --rm -it -p 3000:3000 mcr.microsoft.com/playwright/python:v1.27.1 playwright run-server --port=3000
Listening on ws://127.0.0.1:3000

<!-- gh-comment-id:2629268515 --> @roryeckel commented on GitHub (Feb 2, 2025): After discussion in discord, we want to allow a secondary docker container to serve the playwright API for increased security and bandwidth efficiency. I will make these changes to enable both modes ASAP docker run --rm -it -p 3000:3000 mcr.microsoft.com/playwright/python:v1.27.1 playwright run-server --port=3000 Listening on ws://127.0.0.1:3000
Author
Owner

@roryeckel commented on GitHub (Feb 3, 2025):

I was successful in implementing a separate playwright container, as well as being able to switch to a different mode to install the dependencies directly inside the Open WebUI container. Added docker-compose.playwright.yaml and updated run-compose.sh. Also updated docs

<!-- gh-comment-id:2629718911 --> @roryeckel commented on GitHub (Feb 3, 2025): I was successful in implementing a separate playwright container, as well as being able to switch to a different mode to install the dependencies directly inside the Open WebUI container. Added docker-compose.playwright.yaml and updated run-compose.sh. Also updated docs
Author
Owner

@roryeckel commented on GitHub (Feb 3, 2025):

I've opened a PR here: https://github.com/open-webui/open-webui/pull/9313
Thanks!

<!-- gh-comment-id:2629773900 --> @roryeckel commented on GitHub (Feb 3, 2025): I've opened a PR here: https://github.com/open-webui/open-webui/pull/9313 Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#28475