[PR #9313] [CLOSED] feat: Support Playwright RAG Web Loader #9151

Closed
opened 2025-11-11 18:15:25 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/9313
Author: @roryeckel
Created: 2/4/2025
Status: Closed

Base: devHead: playwright


📝 Commits (10+)

📊 Changes

11 files changed (+260 additions, -39 deletions)

View changed files

📝 backend/open_webui/config.py (+11 -0)
📝 backend/open_webui/main.py (+4 -0)
📝 backend/open_webui/retrieval/web/utils.py (+188 -19)
📝 backend/open_webui/routers/retrieval.py (+3 -2)
📝 backend/open_webui/utils/middleware.py (+9 -15)
📝 backend/requirements.txt (+2 -2)
📝 backend/start.sh (+11 -0)
📝 backend/start_windows.bat (+11 -0)
docker-compose.playwright.yaml (+10 -0)
📝 pyproject.toml (+2 -1)
📝 run-compose.sh (+9 -0)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

This PR is a revision of https://github.com/open-webui/open-webui/pull/9263 intended to resolve https://github.com/open-webui/open-webui/issues/2617

  • Traditional web scraping methods like the safe_web mode can sometimes provide lower quality results due to dynamic page elements loading in through JavaScript. By utilizing playwright, we can allow these items to load and be retrieved.
  • Introduction of playwright support could indirectly benefit Tools and Functions by enabling web browsing agents in the future
  • I've enabled support for two modes: Playwright Server in a separate container, or optionally installing chromium dependencies directly in the Open WebUI container on launch

Added

  • Added environment variables:
    • RAG_WEB_LOADER (can be safe_web or playwright)
    • PLAYWRIGHT_WS_URI (can be None or ws:// something)
  • Added dependency: playwright==1.49.1
  • Introduce ability to switch out WebLoaderClass with the existing SafeWebBaseLoader OR the new SafePlaywrightURLLoader
  • In startup scripts, automatically download chromium for playwright when no PLAYWRIGHT_WS_URI is specified
  • In startup scripts, automatically download nltk tokenizer "punkt_tab" for unstructured inside SafePlaywrightURLLoader when in playwright mode

Changed

  • Updated unstructured from 0.15.9 to 0.16.17 (this fixes issues with downloading nltk tokenizer)
  • Made process_web_search async to accomodate playwright

Deprecated

  • None

Removed

  • None

Fixed

  • None

Security

  • None

Breaking Changes

  • None

Additional Information


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/9313 **Author:** [@roryeckel](https://github.com/roryeckel) **Created:** 2/4/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `playwright` --- ### 📝 Commits (10+) - [`4e8b390`](https://github.com/open-webui/open-webui/commit/4e8b3906821a9a10f4fd0038373291dff41b65cf) Add RAG_WEB_LOADER + Playwright mode + improve stability of search - [`8dafe3c`](https://github.com/open-webui/open-webui/commit/8dafe3cba8c9166e33298f1c640b2bec974c5612) Merge branch 'dev' of https://github.com/open-webui/open-webui - [`2452e27`](https://github.com/open-webui/open-webui/commit/2452e271cddccf0c835ae17f4505471eb41a4313) Refine RAG_WEB_LOADER - [`77ae73e`](https://github.com/open-webui/open-webui/commit/77ae73e659e6fea6da34c3ea913edb3dc4f037a9) Adjust search event messages + translations - [`a84e488`](https://github.com/open-webui/open-webui/commit/a84e488a4ea681c580a2b9cca22fe176f8c0014c) Fix playwright in docker by updating unstructured - [`8da3372`](https://github.com/open-webui/open-webui/commit/8da33721d563754becd0d03bf86605441e0bd9e3) Support PLAYWRIGHT_WS_URI - [`c3df481`](https://github.com/open-webui/open-webui/commit/c3df481b22d8bc13a7deb045e94b0bcf4235224e) Introduce docker-compose.playwright.yaml + run-compose update - [`f837d2c`](https://github.com/open-webui/open-webui/commit/f837d2cdbb40642b157d936c936cdf8eadc44ef3) Merge branch 'dev' of https://github.com/open-webui/open-webui - [`22746c7`](https://github.com/open-webui/open-webui/commit/22746c7a3f86f6f49c8eb3ded3bc45d407f68177) Merge remote-tracking branch 'upstream/dev' - [`1b581b7`](https://github.com/open-webui/open-webui/commit/1b581b714f6749e51bf17c49434976a0c57900c6) Moving code out of playwright branch ### 📊 Changes **11 files changed** (+260 additions, -39 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+11 -0) 📝 `backend/open_webui/main.py` (+4 -0) 📝 `backend/open_webui/retrieval/web/utils.py` (+188 -19) 📝 `backend/open_webui/routers/retrieval.py` (+3 -2) 📝 `backend/open_webui/utils/middleware.py` (+9 -15) 📝 `backend/requirements.txt` (+2 -2) 📝 `backend/start.sh` (+11 -0) 📝 `backend/start_windows.bat` (+11 -0) ➕ `docker-compose.playwright.yaml` (+10 -0) 📝 `pyproject.toml` (+2 -1) 📝 `run-compose.sh` (+9 -0) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [ ] **Testing:** Have you written and run sufficient tests for validating the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To cleary categorize this pull request, prefix the pull request title, using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description This PR is a revision of https://github.com/open-webui/open-webui/pull/9263 intended to resolve https://github.com/open-webui/open-webui/issues/2617 - Traditional web scraping methods like the safe_web mode can sometimes provide lower quality results due to dynamic page elements loading in through JavaScript. By utilizing playwright, we can allow these items to load and be retrieved. - Introduction of playwright support could indirectly benefit Tools and Functions by enabling web browsing agents in the future - I've enabled support for two modes: Playwright Server in a separate container, or optionally installing chromium dependencies directly in the Open WebUI container on launch ### Added - Added environment variables: - RAG_WEB_LOADER (can be safe_web or playwright) - PLAYWRIGHT_WS_URI (can be None or ws:// something) - Added dependency: playwright==1.49.1 - Introduce ability to switch out WebLoaderClass with the existing SafeWebBaseLoader OR the new SafePlaywrightURLLoader - In startup scripts, automatically download chromium for playwright when no PLAYWRIGHT_WS_URI is specified - In startup scripts, automatically download nltk tokenizer "punkt_tab" for unstructured inside SafePlaywrightURLLoader when in playwright mode ### Changed - Updated unstructured from 0.15.9 to 0.16.17 (this fixes issues with downloading nltk tokenizer) - Made process_web_search async to accomodate playwright ### Deprecated - None ### Removed - None ### Fixed - None ### Security - None ### Breaking Changes - None --- ### Additional Information - I've updated the relevant documentation for my feature here: https://github.com/open-webui/docs/compare/main...roryeckel:open-webui-docs:main but haven't created a PR for the docs yet. - Please let me know if I need to do or change anything! --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 18:15:25 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#9151