[PR #7830] [CLOSED] feat: Batch Processing for Large-Scale Document Import #60972

Closed
opened 2026-05-06 04:09:11 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/7830
Author: @gabriel-ecegi
Created: 12/13/2024
Status: Closed

Base: devHead: dev


📝 Commits (2)

📊 Changes

2 files changed (+182 additions, -14 deletions)

View changed files

📝 backend/open_webui/apps/retrieval/main.py (+97 -10)
📝 backend/open_webui/apps/webui/routers/knowledge.py (+85 -4)

📄 Description

Pull Request: Add Batch Processing Support for Document Import

Discussion

Before submitting, make sure you've checked the following:

  • Target branch: Pull request targets the dev branch
  • Description: Added batch processing capability for efficient handling of large document imports
  • Changelog: Added below
  • Documentation: No documentation updates needed
  • Dependencies: No new dependencies added
  • Testing: Tested with large document sets
  • Code review: Performed self-review
  • Prefix: feat: Add batch processing for document imports

Changelog Entry

Description

Added batch processing capability to significantly improve performance when importing large volumes of documents. This enhancement is particularly valuable for enterprise integrations (like Confluence imports) where thousands of documents need to be processed simultaneously.

Added

  • Added new models for batch processing:
    • BatchProcessFilesForm
    • BatchProcessFilesResult
    • BatchProcessFilesResponse
  • Added new endpoint /process/files/batch for batch document processing
  • Added batch file processing support in knowledge base routes
  • Added warning system to handle partial successes in batch operations

Changed

  • Modified knowledge base file addition to support batch operations
  • Improved error handling to allow partial success scenarios
  • Optimized vector DB operations by processing documents in batches

Fixed

  • Improved performance bottleneck in file processing
  • Enhanced error reporting for failed imports while allowing successful ones to proceed

Performance

  • Significantly reduced processing time for large document sets by minimizing DB operations
  • Optimized memory usage through batch processing
  • Reduced API calls by processing multiple files in a single request

Additional Information

This enhancement addresses the performance bottleneck when importing large document sets. Instead of processing files one by one, we can now handle hundreds of files in a single operation, making integrations with enterprise systems more practical.

The implementation includes robust error handling that allows partial success - if some files fail to process, the successful ones are still added to the knowledge base, with clear reporting of any failures.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/7830 **Author:** [@gabriel-ecegi](https://github.com/gabriel-ecegi) **Created:** 12/13/2024 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `dev` --- ### 📝 Commits (2) - [`f2e2b59`](https://github.com/open-webui/open-webui/commit/f2e2b59c181a669a113dcf7f646aafc13defbc44) Add batching - [`440894f`](https://github.com/open-webui/open-webui/commit/440894f8d3ead0ef438e89cbb5a1b46ff3cd58af) Fix process/files/batch ### 📊 Changes **2 files changed** (+182 additions, -14 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/apps/retrieval/main.py` (+97 -10) 📝 `backend/open_webui/apps/webui/routers/knowledge.py` (+85 -4) </details> ### 📄 Description # Pull Request: Add Batch Processing Support for Document Import ### [Discussion](https://github.com/open-webui/open-webui/discussions/7829) **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Pull request targets the `dev` branch - [x] **Description:** Added batch processing capability for efficient handling of large document imports - [x] **Changelog:** Added below - [x] **Documentation:** No documentation updates needed - [x] **Dependencies:** No new dependencies added - [x] **Testing:** Tested with large document sets - [x] **Code review:** Performed self-review - [x] **Prefix:** feat: Add batch processing for document imports # Changelog Entry ### Description Added batch processing capability to significantly improve performance when importing large volumes of documents. This enhancement is particularly valuable for enterprise integrations (like Confluence imports) where thousands of documents need to be processed simultaneously. ### Added - Added new models for batch processing: - `BatchProcessFilesForm` - `BatchProcessFilesResult` - `BatchProcessFilesResponse` - Added new endpoint `/process/files/batch` for batch document processing - Added batch file processing support in knowledge base routes - Added warning system to handle partial successes in batch operations ### Changed - Modified knowledge base file addition to support batch operations - Improved error handling to allow partial success scenarios - Optimized vector DB operations by processing documents in batches ### Fixed - Improved performance bottleneck in file processing - Enhanced error reporting for failed imports while allowing successful ones to proceed ### Performance - Significantly reduced processing time for large document sets by minimizing DB operations - Optimized memory usage through batch processing - Reduced API calls by processing multiple files in a single request --- ### Additional Information This enhancement addresses the performance bottleneck when importing large document sets. Instead of processing files one by one, we can now handle hundreds of files in a single operation, making integrations with enterprise systems more practical. The implementation includes robust error handling that allows partial success - if some files fail to process, the successful ones are still added to the knowledge base, with clear reporting of any failures. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 04:09:12 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#60972