[GH-ISSUE #6844] enh: marker integration for better pdf parsing #53173

Closed
opened 2026-05-05 14:24:03 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @tjbck on GitHub (Nov 11, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/6844

Originally assigned to: @tjbck on GitHub.

Originally created by @tjbck on GitHub (Nov 11, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/6844 Originally assigned to: @tjbck on GitHub.
Author
Owner

@sir3mat commented on GitHub (Nov 12, 2024):

What about docling support?

<!-- gh-comment-id:2470267378 --> @sir3mat commented on GitHub (Nov 12, 2024): What about docling support?
Author
Owner

@jannikstdl commented on GitHub (Nov 13, 2024):

What about docling support?

Seems interesting,

maybe https://github.com/drmingler/docling-api is also worth looking at.

<!-- gh-comment-id:2472886296 --> @jannikstdl commented on GitHub (Nov 13, 2024): > What about docling support? Seems interesting, maybe https://github.com/drmingler/docling-api is also worth looking at.
Author
Owner

@hongbo-miao commented on GitHub (Dec 28, 2024):

MinerU is quite promising.

I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.

MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.

Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩

screenshot

Under the hood, it uses

  • fine-tuned YOLOv8 for formula detection
  • UniMERNet for formula recognition

However, the set up is complex (it is easy now), I provided a guide at https://github.com/opendatalab/MinerU/discussions/1374 if you want to try MinerU ☺️

<!-- gh-comment-id:2564141656 --> @hongbo-miao commented on GitHub (Dec 28, 2024): [MinerU](https://github.com/opendatalab/MinerU) is quite promising. I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth. MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better. Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩 ![screenshot](https://github.com/user-attachments/assets/e692a24d-4c8b-441a-b881-a440b07f1fc5) Under the hood, it uses - fine-tuned YOLOv8 for formula detection - UniMERNet for formula recognition However, the set up is ~~complex~~ (it is easy now), I provided a guide at https://github.com/opendatalab/MinerU/discussions/1374 if you want to try MinerU ☺️
Author
Owner

@hongbo-miao commented on GitHub (Jan 6, 2025):

Recently, there is a new one from Microsoft: MarkItDown. However, it currently lacks support for formulas as well: https://github.com/microsoft/markitdown/issues/17

<!-- gh-comment-id:2574098434 --> @hongbo-miao commented on GitHub (Jan 6, 2025): Recently, there is a new one from Microsoft: [MarkItDown](https://github.com/microsoft/markitdown). However, it currently lacks support for formulas as well: https://github.com/microsoft/markitdown/issues/17
Author
Owner

@flefevre commented on GitHub (Jan 25, 2025):

Perhaps a good solution will be to design a generic interface.
As we have in the general parameter the Content Extraction, Engine" which is set to "default|tika"
perhaps we could have a generic interface to plug to different Extraction content servers?

The integration of a enhanced PDF extractor able to carry scientific papers will be very impacting for us as we have different phd students and scientists.

<!-- gh-comment-id:2613850405 --> @flefevre commented on GitHub (Jan 25, 2025): Perhaps a good solution will be to design a generic interface. As we have in the general parameter the Content Extraction, Engine" which is set to "default|tika" perhaps we could have a generic interface to plug to different Extraction content servers? The integration of a enhanced PDF extractor able to carry scientific papers will be very impacting for us as we have different phd students and scientists.
Author
Owner

@flefevre commented on GitHub (Jan 30, 2025):

Some independant people of our laboratory made some tests comparing basically openwebui/mistralsmall and adobe/reader rag.
They produce a bad report on Openwebui, trying to convince not to use it.
It was mainly due to the fact they didn't know about the architecture of openwebui and missed the fact that RAG in openwebui is dependant of the first module "tika" as ocr extraction tools.

In order to keep people to use openwebui on scientific thematic, we need to have some advices on how to parameterize Openwebui to have good results.

  • Is there a difference beween chroma and milvius?
  • Do you plan to add more documentation on this topic?
  • Do you plan to integration of another ocr tool such as docling ? if so, what is your potential roadmap

Thanks for sharing your expertise and vision

<!-- gh-comment-id:2623975487 --> @flefevre commented on GitHub (Jan 30, 2025): Some independant people of our laboratory made some tests comparing basically openwebui/mistralsmall and adobe/reader rag. They produce a bad report on Openwebui, trying to convince not to use it. It was mainly due to the fact they didn't know about the architecture of openwebui and missed the fact that RAG in openwebui is dependant of the first module "tika" as ocr extraction tools. In order to keep people to use openwebui on scientific thematic, we need to have some advices on how to parameterize Openwebui to have good results. - Is there a difference beween chroma and milvius? - Do you plan to add more documentation on this topic? - Do you plan to integration of another ocr tool such as docling ? if so, what is your potential roadmap Thanks for sharing your expertise and vision
Author
Owner

@MichaelKarpe commented on GitHub (Feb 2, 2025):

For local experimental purposes, it takes minimal changes to use docling as a content extraction engine, see #9238. Unfortunately I will not have time in the coming days or weeks to make it ready for a release, if you have a few hours or more please do feel free to build upon this PR for docling integration.

<!-- gh-comment-id:2629400424 --> @MichaelKarpe commented on GitHub (Feb 2, 2025): For local experimental purposes, it takes minimal changes to use docling as a content extraction engine, see #9238. Unfortunately I will not have time in the coming days or weeks to make it ready for a release, if you have a few hours or more please do feel free to build upon this PR for docling integration.
Author
Owner

@oatmealm commented on GitHub (May 30, 2025):

@tjbck will it be possbile to allow custom urls for those who want to try the self-hosted version? Seem like the address of the provider's endpoint is hard-coded.

<!-- gh-comment-id:2921361809 --> @oatmealm commented on GitHub (May 30, 2025): @tjbck will it be possbile to allow custom urls for those who want to try the self-hosted version? Seem like the address of the provider's endpoint is hard-coded.
Author
Owner

@Mte90 commented on GitHub (Jun 27, 2025):

I did this one as external https://github.com/CodeAtCode/deadsimple probably can improved to use other tools instead of markitdown.

<!-- gh-comment-id:3012628340 --> @Mte90 commented on GitHub (Jun 27, 2025): I did this one as external https://github.com/CodeAtCode/deadsimple probably can improved to use other tools instead of markitdown.
Author
Owner

@vojtapolasek commented on GitHub (Jul 9, 2025):

It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.

<!-- gh-comment-id:3052578249 --> @vojtapolasek commented on GitHub (Jul 9, 2025): It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.
Author
Owner

@HenkieTenkie62 commented on GitHub (Jul 16, 2025):

It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this.

This would be a great option. And I think not alot of code would be involved to change the address for API calls. Would be a great addition as I've found the standard text extraction and even Docling including easyOCR and RapidOCR to be very underwhelming when it comes to formula recognition.
In alot of scientific and engineering documents these formulas are essential, and marker (With the right settings!) rarely let's you down (only minor markup issues).

<!-- gh-comment-id:3077444979 --> @HenkieTenkie62 commented on GitHub (Jul 16, 2025): > It seems that there is an option to use hosted Marker API. However, I would like to be able to change the URL so that I can use selfhosted marker API. Please consider this. This would be a great option. And I think not alot of code would be involved to change the address for API calls. Would be a great addition as I've found the standard text extraction and even Docling including easyOCR and RapidOCR to be very underwhelming when it comes to formula recognition. In alot of scientific and engineering documents these formulas are essential, and marker (With the right settings!) rarely let's you down (only minor markup issues).
Author
Owner

@homjay commented on GitHub (Aug 30, 2025):

MinerU is quite promising.

I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth.

MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better.

Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩

screenshot

Under the hood, it uses

  • fine-tuned YOLOv8 for formula detection
  • UniMERNet for formula recognition

However, the set up is complex (it is easy now), I provided a guide at opendatalab/MinerU#1374 if you want to try MinerU ☺️

Now Mineru supports API mode, so it's a good time to integrate it.

<!-- gh-comment-id:3239171968 --> @homjay commented on GitHub (Aug 30, 2025): > [MinerU](https://github.com/opendatalab/MinerU) is quite promising. > > I tried both Docling and MinerU. Unfortunately, Docling does not currently support formula recognition. However, the overall experience with Docling is quite smooth. > > MinerU provides the best results and can even recognize formulas. Additionally, its table parsing and layout detection are much better. > > Below is a markdown generated by MinerU. As you can see, even though this formula is very complex, it recognized it perfectly. 🤩 > > ![screenshot](https://github.com/user-attachments/assets/e692a24d-4c8b-441a-b881-a440b07f1fc5) > > Under the hood, it uses > > * fine-tuned YOLOv8 for formula detection > * UniMERNet for formula recognition > > However, the set up is ~complex~ (it is easy now), I provided a guide at [opendatalab/MinerU#1374](https://github.com/opendatalab/MinerU/discussions/1374) if you want to try MinerU ☺️ Now Mineru supports API mode, so it's a good time to integrate it.
Author
Owner

@schmik commented on GitHub (Sep 1, 2025):

One shall keep in mind that using YOLOv8 in commercial contexts needs licensing, though. https://www.ultralytics.com/license

<!-- gh-comment-id:3241115055 --> @schmik commented on GitHub (Sep 1, 2025): One shall keep in mind that using YOLOv8 in commercial contexts needs licensing, though. [https://www.ultralytics.com/license](https://www.ultralytics.com/license)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#53173