mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[GH-ISSUE #6844] enh: marker integration for better pdf parsing #30035
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tjbck on GitHub (Nov 11, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/6844
Originally assigned to: @tjbck on GitHub.
@sir3mat commented on GitHub (Nov 12, 2024):
What about docling support?
@jannikstdl commented on GitHub (Nov 13, 2024):
Seems interesting,
maybe https://github.com/drmingler/docling-api is also worth looking at.
@hongbo-miao commented on GitHub (Dec 28, 2024):
MinerU is quite promising.
I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.
MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.
Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩
Under the hood, it uses
However, the set up is
complex(it is easy now), I provided a guide at https://github.com/opendatalab/MinerU/discussions/1374 if you want to try MinerU ☺️@hongbo-miao commented on GitHub (Jan 6, 2025):
Recently, there is a new one from Microsoft: MarkItDown. However, it currently lacks support for formulas as well: https://github.com/microsoft/markitdown/issues/17
@flefevre commented on GitHub (Jan 25, 2025):
Perhaps a good solution will be to design a generic interface.
As we have in the general parameter the Content Extraction, Engine" which is set to "default|tika"
perhaps we could have a generic interface to plug to different Extraction content servers?
The integration of a enhanced PDF extractor able to carry scientific papers will be very impacting for us as we have different phd students and scientists.
@flefevre commented on GitHub (Jan 30, 2025):
Some independant people of our laboratory made some tests comparing basically openwebui/mistralsmall and adobe/reader rag.
They produce a bad report on Openwebui, trying to convince not to use it.
It was mainly due to the fact they didn't know about the architecture of openwebui and missed the fact that RAG in openwebui is dependant of the first module "tika" as ocr extraction tools.
In order to keep people to use openwebui on scientific thematic, we need to have some advices on how to parameterize Openwebui to have good results.
Thanks for sharing your expertise and vision
@MichaelKarpe commented on GitHub (Feb 2, 2025):
For local experimental purposes, it takes minimal changes to use docling as a content extraction engine, see #9238. Unfortunately I will not have time in the coming days or weeks to make it ready for a release, if you have a few hours or more please do feel free to build upon this PR for docling integration.
@oatmealm commented on GitHub (May 30, 2025):
@tjbck will it be possbile to allow custom urls for those who want to try the self-hosted version? Seem like the address of the provider's endpoint is hard-coded.
@Mte90 commented on GitHub (Jun 27, 2025):
I did this one as external https://github.com/CodeAtCode/deadsimple probably can improved to use other tools instead of markitdown.
@vojtapolasek commented on GitHub (Jul 9, 2025):
It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.
@HenkieTenkie62 commented on GitHub (Jul 16, 2025):
This would be a great option. And I think not alot of code would be involved to change the address for API calls. Would be a great addition as I've found the standard text extraction and even Docling including easyOCR and RapidOCR to be very underwhelming when it comes to formula recognition.
In alot of scientific and engineering documents these formulas are essential, and marker (With the right settings!) rarely let's you down (only minor markup issues).
@homjay commented on GitHub (Aug 30, 2025):
Now Mineru supports API mode, so it's a good time to integrate it.
@schmik commented on GitHub (Sep 1, 2025):
One shall keep in mind that using YOLOv8 in commercial contexts needs licensing, though. https://www.ultralytics.com/license