mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #3474] Support Apache Tika for RAG text extraction #13279
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nickovs on GitHub (Jun 27, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/3474
Currently, document attachments for RAG are parsed using by selecting from a grab-bag of document loaders from the LangChain community set. While this avoids the need for external services, the supported file type set is small, the results are not always high quality in terms of output order and spacing, and it doesn't support valuable features such as OCR.
It would be great if Open WebUI optionally allowed use of Apache Tika as an alternative way of parsing attachments.
Tika has mature support for parsing hundreds of different document formats, which would greatly expand the set of documents that could be passed in to Open WebUI. It also has integrated support for applying OCR to embedded images, so for instance text extraction from a PDF that is made up of scans of pages "just works".
Importantly, useful installations of Tika are available as completely self-contained Docker images with a REST interface, including versions with bundled Tesseract OCR, making deployment as part of a
docker-compose.ymlvery easy.Supporting Tika would involve providing a configuration option to let the admin set a Tika service URL and then fixing up
backend/apps/rag/main.pyto offload most of the text extraction work to Tika if this variable is set.@tjbck commented on GitHub (Jun 27, 2024):
I believe the new Filter function should enable this use case. I'd love to collaborate on this if you're interested!