mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[GH-ISSUE #2617] enh: playwright/selenium web search support for RAG #12947
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tjbck on GitHub (May 28, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/2617
@noperator commented on GitHub (Jan 5, 2025):
Is the idea here that, instead of just using
langchain_community.document_loaders.WebBaseLoader._scrape()(below), we also optionally use something likeAsyncChromiumLoader?4bc9904b3c/backend/open_webui/retrieval/web/utils.py (L57-L64)In addition to loading results dynamically, I'm also interested in adding support for:
@roryeckel commented on GitHub (Jan 28, 2025):
I am testing some code changes here. First roadblock with AsyncChromiumLoader is its internal usage of asyncio.run is causing "ERROR: asyncio.run() cannot be called from a running event loop"
I think PlaywrightURLLoader may be a better alternative because it has both sync and async implementations specific to Playwright.
However when I tried it, I also got "playwright._impl._errors.Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead."
Perhaps retrieval.py needs process_web_search to be refactored to async. However when I tried this, the event emitter events didn't go in order properly. Oof, can't catch a break, but I'll keep trying
Edit: managed to get a draft working & still cleaning up
@roryeckel commented on GitHub (Jan 29, 2025):
Please see my progress, I'm open to feedback. Will create PR soon
It is functioning great for me here. With my added URL validation, it seems to be more stable even in the old mode as well.
4e8b390682TODO for myself: update documentation for environment variable "RAG_WEB_LOADER", fix internationalization I forgot about, and test docker
@roryeckel commented on GitHub (Jan 31, 2025):
Fixed the internationalization and cleaned up a bit.
However I did test docker and it does not work, 403s getting thrown by Playwright. I will investigate that
https://github.com/roryeckel/open-webui
Update: it seems like the 403 error is actually coming from download_nltk_packages. However, the startup script has already been adjusted to install the nltk package if the environment variable is set.
Error is caused by this access denial: https://utic-public-cf.s3.amazonaws.com/nltk_data_3.8.2.tar.gz
Our unstructured package is far too out of date, they revamped the downloading system to use a different method and URL
Also, I'd like to move the chromium dep installation into the data cache folder
@roryeckel commented on GitHub (Feb 1, 2025):
I've written documentation for my feature:
452c447edc@roryeckel commented on GitHub (Feb 2, 2025):
After updating the unstructured package, Playwright + unstructured web loader is functioning in docker.
I tested by using the official docker compose with environment "RAG_WEB_LOADER=playwright" set on my branch:
https://github.com/roryeckel/open-webui
https://github.com/roryeckel/open-webui-docs
May someone please share the next steps for me to take to get this reviewed? Should I create a PR? Thanks in advance!
@roryeckel commented on GitHub (Feb 2, 2025):
After discussion in discord, we want to allow a secondary docker container to serve the playwright API for increased security and bandwidth efficiency. I will make these changes to enable both modes ASAP
docker run --rm -it -p 3000:3000 mcr.microsoft.com/playwright/python:v1.27.1 playwright run-server --port=3000
Listening on ws://127.0.0.1:3000
@roryeckel commented on GitHub (Feb 3, 2025):
I was successful in implementing a separate playwright container, as well as being able to switch to a different mode to install the dependencies directly inside the Open WebUI container. Added docker-compose.playwright.yaml and updated run-compose.sh. Also updated docs
@roryeckel commented on GitHub (Feb 3, 2025):
I've opened a PR here: https://github.com/open-webui/open-webui/pull/9313
Thanks!