[PR #1292] [MERGED] Add htm/html support for RAG documents #7424

Closed
opened 2025-11-11 17:26:01 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/1292
Author: @ddanat-smm
Created: 3/25/2024
Status: Merged
Merged: 3/26/2024
Merged by: @tjbck

Base: devHead: dev


📝 Commits (5)

  • 784a6ec include html langchain loader for RAG
  • 77f4ffd add htm/html to supported extensions in ui
  • c91a5d8 switch to using BeautifulSoup HTML loader so title is also captured
  • 6307adf feat: better error handling
  • 3688955 fix: encoding issue

📊 Changes

3 files changed (+64 additions, -43 deletions)

View changed files

📝 backend/apps/rag/main.py (+59 -43)
📝 backend/constants.py (+2 -0)
📝 src/lib/constants.ts (+3 -0)

📄 Description

hey folks, here's a quick and dirty PR for HTML document support in RAG documents. If there's anything I missed, just let me know :)

Pull Request Checklist

  • Description: Briefly describe the changes in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation? (extension support not documented at all yet)
  • Dependencies: Are there any new dependencies? (no) Have you updated the dependency versions in the documentation?

Description

Add htm/html support for RAG documents


Changelog Entry

Added

  • Added .htm and .html to supported extensions in UI

Fixed

n/a

Changed

Removed

n/a


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/1292 **Author:** [@ddanat-smm](https://github.com/ddanat-smm) **Created:** 3/25/2024 **Status:** ✅ Merged **Merged:** 3/26/2024 **Merged by:** [@tjbck](https://github.com/tjbck) **Base:** `dev` ← **Head:** `dev` --- ### 📝 Commits (5) - [`784a6ec`](https://github.com/open-webui/open-webui/commit/784a6ec85e18b9b798fb7292acf9015beae7fada) include html langchain loader for RAG - [`77f4ffd`](https://github.com/open-webui/open-webui/commit/77f4ffddc1ce8cc57ce7227999fc87049c401605) add htm/html to supported extensions in ui - [`c91a5d8`](https://github.com/open-webui/open-webui/commit/c91a5d8b1fd36827f1b6e45ffc7ef9d36780a280) switch to using BeautifulSoup HTML loader so title is also captured - [`6307adf`](https://github.com/open-webui/open-webui/commit/6307adfba1048c01a9954723f8d16b02fe984470) feat: better error handling - [`3688955`](https://github.com/open-webui/open-webui/commit/3688955c776c5c03afd94aa86636f1f8f80de738) fix: encoding issue ### 📊 Changes **3 files changed** (+64 additions, -43 deletions) <details> <summary>View changed files</summary> 📝 `backend/apps/rag/main.py` (+59 -43) 📝 `backend/constants.py` (+2 -0) 📝 `src/lib/constants.ts` (+3 -0) </details> ### 📄 Description hey folks, here's a quick and dirty PR for HTML document support in RAG documents. If there's anything I missed, just let me know :) ## Pull Request Checklist - [x] **Description:** Briefly describe the changes in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation? (extension support not documented at all yet) - [x] **Dependencies:** Are there any new dependencies? (no) Have you updated the dependency versions in the documentation? --- ## Description Add htm/html support for RAG documents --- ### Changelog Entry ### Added - Added .htm and .html to supported extensions in UI ### Fixed n/a ### Changed - Included langchain's [beautifulsoup4 html loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/html#loading-html-with-beautifulsoup4) for .html/.htm RAG documents ### Removed n/a --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 17:26:01 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#7424