mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 03:18:23 -05:00
[PR #15548] [MERGED] fix: text/html files being detected as text when loaded with docling/tika #39497
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/15548
Author: @expruc
Created: 7/6/2025
Status: ✅ Merged
Merged: 7/8/2025
Merged by: @tjbck
Base:
dev← Head:fix/docling_ignore_html📝 Commits (1)
453a2bdfixed issue where text/html files being detected as text when loaded📊 Changes
1 file changed (+4 additions, -1 deletions)
View changed files
📝
backend/open_webui/retrieval/loaders/main.py(+4 -1)📄 Description
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
devbranch.Changelog Entry
Description
Fixed issue where html files being detected as
text/htmland parsed with the text parser instead of the configured external parser (docling/tika).Fixed
Additional Information
This issue happens when trying to insert html files using UI/API, where the content extraction engine is docling/tika. The reason for this is using a function that detects the file metadata type, but classifies the html file as text since the detected type is text/html. This PR fixes this by adding a condition to not classify such files as text if they have both text and html on their type.
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.