mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[PR #14311] [MERGED] feat: Marker api content extraction support #39070
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/14311
Author: @Hisma
Created: 5/25/2025
Status: ✅ Merged
Merged: 5/28/2025
Merged by: @tjbck
Base:
dev← Head:marker-api-content-extraction📝 Commits (5)
9faa4c6Merge pull request #14194 from open-webui/devb8e1621Merge pull request #14364 from open-webui/deva9405ccfeat: Marker api content extraction supporte12a79cfix: handle json output format correctly19bb358fix: addDatalab Marker APIto Content Extraction Engine Dropdown📊 Changes
6 files changed (+517 additions, -1 deletions)
View changed files
📝
backend/open_webui/config.py(+54 -0)📝
backend/open_webui/main.py(+18 -0)➕
backend/open_webui/retrieval/loaders/datalab_marker_loader.py(+200 -0)📝
backend/open_webui/retrieval/loaders/main.py(+18 -1)📝
backend/open_webui/routers/retrieval.py(+81 -0)📝
src/lib/components/admin/Settings/Documents.svelte(+146 -0)📄 Description
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
Target branch: Please verify that the pull request targets the
devbranch.Description: Provide a concise description of the changes made in this pull request.
Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources? - Detailed documentation is attached.
Datalab_Marker_API_Quick_Reference.md
Datalab_Marker_API_User_Guide.md
Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation? - no
Testing: Have you written and run sufficient tests to validate the changes? - tested in dev container -
docker.io/hisma/openwebui:devCode review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? yes
Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
Changelog Entry
Description
Marker has the option of using google gemini flash to perform pdf OCR, which is available via an "Use LLM" toggle. The user can also select multiple OCR languages via a multi-select window, and document output format (markdown, json, or html). There are also other optional features that can be toggled on/off such as
force_ocr,paginate,strip_existing_ocr,disable_image_extraction, andskip_cache, giving the user a lot of flexibility over the content extraction.Marker repo is here -
https://github.com/VikParuchuri/marker
The way this feature works is that it uses marker's official hosted API for accessing the marker OCR engine.
https://www.datalab.to/
This addon specifically implements the marker API -
https://www.datalab.to/app/docs#marker
Added
Changed
backend/open_webui/config.pybackend/open_webui/main.pybackend/open_webui/retrieval/loaders/datalab_marker_loader.pybackend/open_webui/retrieval/loaders/main.pybackend/open_webui/routers/retrieval.pysrc/lib/components/admin/Settings/Documents.svelteDeprecated
Removed
Fixed
Security
Breaking Changes
Additional Information
Screenshots or Videos
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.