[PR #10430] [CLOSED] Feat: Adding Support for Azure AI Document Intelligence for Content Extraction #9328

Closed
opened 2025-11-11 18:19:59 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/10430
Author: @Micca
Created: 2/20/2025
Status: Closed

Base: devHead: feature/document_intelligence_support


📝 Commits (2)

  • 35f3824 feat: Implement Document Intelligence as Content Extraction Engine
  • f8183e3 i18n: Update translations for Document Intelligence Update

📊 Changes

57 files changed (+296 additions, -44 deletions)

View changed files

📝 backend/open_webui/config.py (+12 -0)
📝 backend/open_webui/main.py (+4 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+22 -0)
📝 backend/open_webui/routers/retrieval.py (+26 -1)
📝 backend/requirements.txt (+1 -0)
📝 pyproject.toml (+1 -0)
📝 src/lib/apis/retrieval/index.ts (+6 -0)
📝 src/lib/components/admin/Settings/Documents.svelte (+35 -1)
📝 src/lib/i18n/locales/ar-BH/translation.json (+3 -0)
📝 src/lib/i18n/locales/bg-BG/translation.json (+3 -0)
📝 src/lib/i18n/locales/bn-BD/translation.json (+3 -0)
📝 src/lib/i18n/locales/ca-ES/translation.json (+3 -0)
📝 src/lib/i18n/locales/ceb-PH/translation.json (+3 -0)
📝 src/lib/i18n/locales/cs-CZ/translation.json (+3 -0)
📝 src/lib/i18n/locales/da-DK/translation.json (+3 -0)
📝 src/lib/i18n/locales/de-DE/translation.json (+3 -0)
📝 src/lib/i18n/locales/el-GR/translation.json (+3 -0)
📝 src/lib/i18n/locales/en-GB/translation.json (+3 -0)
📝 src/lib/i18n/locales/en-US/translation.json (+3 -0)
📝 src/lib/i18n/locales/es-ES/translation.json (+3 -0)

...and 37 more files

📄 Description

Pull Request Checklist

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

The current OpenWebUi's standard content extraction engine cannot handle image-based pdfs. This PR implements Azure AI Document Intelligence as a content extraction engine, so this project has another engine available for users that need structured OCR-parsed documents. The Document Intelligence content extraction engine connects open-webui to a Microsoft service to extract structured content from PDFs, XLS, DOCX, and PPT files.

Added

  • Added environment variables:

    • DOCUMENT_INTELLIGENCE_ENDPOINT (can be None or https://.cognitiveservices.azure.com)
    • DOCUMENT_INTELLIGENCE_KEY (can be None or 32-character alphanumeric key)
  • Added dependency: azure-ai-documentintelligence==1.0.0

  • Implemented Azure AI Document Intelligence as content extraction engine for docx, pdf, ppt and xlsx

  • Added ability to configure Document Intelligence as content extraction engine, with the related environment variables being saved in PersistentConfig

  • Added localization for newly introduced fields

Changed

  • None

Deprecated

  • None

Removed

  • None

Fixed

  • None

Security

  • None

Breaking Changes

  • None

Additional Information

  • additional Information to the changes can be found here: #9583

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/10430 **Author:** [@Micca](https://github.com/Micca) **Created:** 2/20/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `feature/document_intelligence_support` --- ### 📝 Commits (2) - [`35f3824`](https://github.com/open-webui/open-webui/commit/35f3824932833fe77ef3bce54b86803cda4838a6) feat: Implement Document Intelligence as Content Extraction Engine - [`f8183e3`](https://github.com/open-webui/open-webui/commit/f8183e3904ab28ea49bf95b48d0907ed597bd93c) i18n: Update translations for Document Intelligence Update ### 📊 Changes **57 files changed** (+296 additions, -44 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+12 -0) 📝 `backend/open_webui/main.py` (+4 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+22 -0) 📝 `backend/open_webui/routers/retrieval.py` (+26 -1) 📝 `backend/requirements.txt` (+1 -0) 📝 `pyproject.toml` (+1 -0) 📝 `src/lib/apis/retrieval/index.ts` (+6 -0) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+35 -1) 📝 `src/lib/i18n/locales/ar-BH/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/bg-BG/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/bn-BD/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/ca-ES/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/ceb-PH/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/cs-CZ/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/da-DK/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/de-DE/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/el-GR/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/en-GB/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/en-US/translation.json` (+3 -0) 📝 `src/lib/i18n/locales/es-ES/translation.json` (+3 -0) _...and 37 more files_ </details> ### 📄 Description # Pull Request Checklist **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [ ] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Have you written and run sufficient tests for validating the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To cleary categorize this pull request, prefix the pull request title, using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description The current OpenWebUi's standard content extraction engine cannot handle image-based pdfs. This PR implements Azure AI Document Intelligence as a content extraction engine, so this project has another engine available for users that need structured OCR-parsed documents. The Document Intelligence content extraction engine connects open-webui to a Microsoft service to extract structured content from PDFs, XLS, DOCX, and PPT files. ### Added * Added environment variables: * DOCUMENT_INTELLIGENCE_ENDPOINT (can be None or https://<your-resource-name>.cognitiveservices.azure.com) * DOCUMENT_INTELLIGENCE_KEY (can be None or 32-character alphanumeric key) * Added dependency: azure-ai-documentintelligence==1.0.0 * Implemented Azure AI Document Intelligence as content extraction engine for docx, pdf, ppt and xlsx * Added ability to configure Document Intelligence as content extraction engine, with the related environment variables being saved in PersistentConfig * Added localization for newly introduced fields ### Changed * None ### Deprecated * None ### Removed * None ### Fixed * None ### Security * None ### Breaking Changes * None --- ### Additional Information * additional Information to the changes can be found here: #9583 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 18:19:59 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#9328