[GH-ISSUE #3474] Support Apache Tika for RAG text extraction #13279

Closed
opened 2026-04-19 20:03:53 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @nickovs on GitHub (Jun 27, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/3474

Currently, document attachments for RAG are parsed using by selecting from a grab-bag of document loaders from the LangChain community set. While this avoids the need for external services, the supported file type set is small, the results are not always high quality in terms of output order and spacing, and it doesn't support valuable features such as OCR.

It would be great if Open WebUI optionally allowed use of Apache Tika as an alternative way of parsing attachments.

Tika has mature support for parsing hundreds of different document formats, which would greatly expand the set of documents that could be passed in to Open WebUI. It also has integrated support for applying OCR to embedded images, so for instance text extraction from a PDF that is made up of scans of pages "just works".

Importantly, useful installations of Tika are available as completely self-contained Docker images with a REST interface, including versions with bundled Tesseract OCR, making deployment as part of a docker-compose.yml very easy.

Supporting Tika would involve providing a configuration option to let the admin set a Tika service URL and then fixing up backend/apps/rag/main.py to offload most of the text extraction work to Tika if this variable is set.

Originally created by @nickovs on GitHub (Jun 27, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/3474 Currently, document attachments for RAG are parsed using by [selecting](https://github.com/open-webui/open-webui/blob/9e4dd4b86f77653b59dc1fa97fe3f72e8252b359/backend/apps/rag/main.py#L970-L1062) from a [grab-bag](https://github.com/open-webui/open-webui/blob/9e4dd4b86f77653b59dc1fa97fe3f72e8252b359/backend/apps/rag/main.py#L21-L37) of document loaders from the [LangChain community set](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/). While this avoids the need for external services, the supported file type set is small, the results are not always high quality in terms of output order and spacing, and it doesn't support valuable features such as OCR. It would be great if Open WebUI optionally allowed use of [Apache Tika](https://tika.apache.org) as an alternative way of parsing attachments. Tika has mature support for parsing [hundreds of different document formats](https://tika.apache.org/3.0.0-BETA/formats.html), which would greatly expand the set of documents that could be passed in to Open WebUI. It also has integrated support for applying OCR to embedded images, so for instance text extraction from a PDF that is made up of scans of pages "just works". Importantly, useful installations of Tika are available as completely [self-contained Docker images](https://hub.docker.com/r/apache/tika) with a REST interface, including versions with bundled Tesseract OCR, making deployment as part of a `docker-compose.yml` very easy. Supporting Tika would involve providing a configuration option to let the admin set a Tika service URL and then fixing up `backend/apps/rag/main.py` to offload most of the text extraction work to Tika if this variable is set.
Author
Owner

@tjbck commented on GitHub (Jun 27, 2024):

I believe the new Filter function should enable this use case. I'd love to collaborate on this if you're interested!

<!-- gh-comment-id:2195613408 --> @tjbck commented on GitHub (Jun 27, 2024): I believe the new Filter function should enable this use case. I'd love to collaborate on this if you're interested!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#13279