[PR #17043] [CLOSED] fix: Web search results text cleaning before upsert #24303

Closed
opened 2026-04-20 05:19:58 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/17043
Author: @PVBLIC-F
Created: 8/29/2025
Status: Closed

Base: devHead: fix/clean-vectors


📝 Commits (2)

  • 2407d9b Merge pull request #16859 from open-webui/dev
  • 201acd3 fix: improve web text cleaning to remove excessive whitespace

📊 Changes

1 file changed (+21 additions, -2 deletions)

View changed files

📝 backend/open_webui/retrieval/web/utils.py (+21 -2)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests to validate the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Improved web search text cleaning to prevent vector database pollution with excessive whitespace and malformed content. The fix adds proper text normalization and encoding correction for web-scraped content before it's embedded into the vector database, significantly improving search quality and reducing storage overhead.

Added

  • Text cleaning helper method _clean_text() in SafeWebBaseLoader class
  • ftfy import for robust text encoding fixes in web content processing
  • Whitespace normalization using BeautifulSoup's strip=True and separator=' ' parameters

Changed

  • Modified lazy_load() method in SafeWebBaseLoader to use cleaned text extraction
  • Modified alazy_load() method in SafeWebBaseLoader to use cleaned text extraction
  • Enhanced text processing pipeline to remove excessive newlines (\n\n\n\n) and normalize spacing

Deprecated

  • [List any deprecated functionality or features that have been removed]

Removed

  • [List any removed features, files, or functionalities]

Fixed

  • Web content cleaning: Resolved vector database pollution caused by excessive whitespace, newlines, and malformed text from web scraping
  • Text encoding issues: Added ftfy.fix_text() processing to handle mojibake and Unicode problems in scraped content

Security

  • [List any new or updated security-related changes, including vulnerability fixes]

Breaking Changes

  • BREAKING CHANGE: [List any breaking changes affecting compatibility or functionality]

Additional Information

  • [Insert any additional context, notes, or explanations for the changes]
    • [Reference any related issues, commits, or other relevant information]

Screenshots or Videos

  • BEFORE
    CleanShot 2025-08-29 at 07 45 26
    -AFTER
    CleanShot 2025-08-29 at 09 15 34

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/17043 **Author:** [@PVBLIC-F](https://github.com/PVBLIC-F) **Created:** 8/29/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/clean-vectors` --- ### 📝 Commits (2) - [`2407d9b`](https://github.com/open-webui/open-webui/commit/2407d9b905978d68619bdce4021e424046ec8df9) Merge pull request #16859 from open-webui/dev - [`201acd3`](https://github.com/open-webui/open-webui/commit/201acd3e609f43e83bd76e1ac5bfc7c87265f999) fix: improve web text cleaning to remove excessive whitespace ### 📊 Changes **1 file changed** (+21 additions, -2 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/web/utils.py` (+21 -2) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [ ] **Target branch:** Please verify that the pull request targets the `dev` branch. - [ ] **Description:** Provide a concise description of the changes made in this pull request. - [ ] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [ ] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [ ] **Testing:** Have you written and run sufficient tests to validate the changes? - [ ] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [ ] **Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - Improved web search text cleaning to prevent vector database pollution with excessive whitespace and malformed content. The fix adds proper text normalization and encoding correction for web-scraped content before it's embedded into the vector database, significantly improving search quality and reducing storage overhead. ### Added - Text cleaning helper method _clean_text() in SafeWebBaseLoader class - ftfy import for robust text encoding fixes in web content processing - Whitespace normalization using BeautifulSoup's strip=True and separator=' ' parameters ### Changed - Modified lazy_load() method in SafeWebBaseLoader to use cleaned text extraction - Modified alazy_load() method in SafeWebBaseLoader to use cleaned text extraction - Enhanced text processing pipeline to remove excessive newlines (\n\n\n\n) and normalize spacing ### Deprecated - [List any deprecated functionality or features that have been removed] ### Removed - [List any removed features, files, or functionalities] ### Fixed - Web content cleaning: Resolved vector database pollution caused by excessive whitespace, newlines, and malformed text from web scraping - Text encoding issues: Added ftfy.fix_text() processing to handle mojibake and Unicode problems in scraped content ### Security - [List any new or updated security-related changes, including vulnerability fixes] ### Breaking Changes - **BREAKING CHANGE**: [List any breaking changes affecting compatibility or functionality] --- ### Additional Information - [Insert any additional context, notes, or explanations for the changes] - [Reference any related issues, commits, or other relevant information] ### Screenshots or Videos - BEFORE <img width="663" height="938" alt="CleanShot 2025-08-29 at 07 45 26" src="https://github.com/user-attachments/assets/83107110-61c0-4e97-9655-dca29ad1ace6" /> -AFTER <img width="662" height="954" alt="CleanShot 2025-08-29 at 09 15 34" src="https://github.com/user-attachments/assets/b27e949a-d4c9-4f31-a9e7-1e218ea8ee28" /> ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 05:19:58 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#24303