[PR #22321] feat: table-aware RAG ingestion for CSV, TSV, and Excel files #42242

Open
opened 2026-04-25 14:13:20 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22321
Author: @salim4n
Created: 3/6/2026
Status: 🔄 Open

Base: devHead: feat/table-aware-rag-ingestion


📝 Commits (10+)

📊 Changes

7 files changed (+626 additions, -54 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+35 -5)
backend/open_webui/retrieval/loaders/table.py (+213 -0)
📝 backend/open_webui/routers/retrieval.py (+70 -49)
backend/open_webui/test/retrieval/__init__.py (+0 -0)
backend/open_webui/test/retrieval/loaders/__init__.py (+0 -0)
backend/open_webui/test/retrieval/loaders/test_table.py (+302 -0)

📄 Description

Summary

Closes discussion #22319

Replace CSVLoader and UnstructuredExcelLoader with custom table-aware loaders that preserve row integrity for better RAG retrieval on tabular data.

  • Row integrity: never splits mid-record, each chunk contains complete rows
  • Column context: headers repeated in every chunk so the LLM can interpret values
  • Delimiter auto-detection: comma, semicolon, tab, pipe (handles EU exports)
  • Multi-sheet Excel: per-sheet chunking with sheet name metadata
  • Skip text splitting: pre-chunked table docs bypass the text splitter
  • Configurable: TABLE_ROWS_PER_CHUNK env var / config (default: 1)

Changed files

  • backend/open_webui/retrieval/loaders/table.py — new TableAwareCSVLoader, TableAwareExcelLoader
  • backend/open_webui/retrieval/loaders/main.py — route CSV/TSV/Excel to new loaders
  • backend/open_webui/routers/retrieval.py — bypass text splitting for file_type="table" docs
  • backend/open_webui/config.py — add TABLE_ROWS_PER_CHUNK

Testing

  • 24 unit tests (delimiter detection, chunking, metadata, encoding fallback, empty files, multi-sheet Excel, invalid file handling)
  • All tests passing locally (pytest)
  • Personally tested: upload CSV with semicolons, verified chunks preserve rows with headers
  • Personally tested: upload multi-sheet Excel, verified per-sheet chunking with sheet metadata
  • Personally tested: TABLE_ROWS_PER_CHUNK config respected (tested with 1 and 5)

Contributor License Agreement

I have read and agree to the Contributor License Agreement.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22321 **Author:** [@salim4n](https://github.com/salim4n) **Created:** 3/6/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `feat/table-aware-rag-ingestion` --- ### 📝 Commits (10+) - [`fe6783c`](https://github.com/open-webui/open-webui/commit/fe6783c16699911c7be17392596d579333fb110c) Merge pull request #19030 from open-webui/dev - [`fc05e0a`](https://github.com/open-webui/open-webui/commit/fc05e0a6c5d39da60b603b4d520f800d6e36f748) Merge pull request #19405 from open-webui/dev - [`e3faec6`](https://github.com/open-webui/open-webui/commit/e3faec62c58e3a83d89aa3df539feacefa125e0c) Merge pull request #19416 from open-webui/dev - [`9899293`](https://github.com/open-webui/open-webui/commit/9899293f050ad50ae12024cbebee7e018acd851e) Merge pull request #19448 from open-webui/dev - [`140605e`](https://github.com/open-webui/open-webui/commit/140605e660b8186a7d5c79fb3be6ffb147a2f498) Merge pull request #19462 from open-webui/dev - [`6f1486f`](https://github.com/open-webui/open-webui/commit/6f1486ffd0cb288d0e21f41845361924e0d742b3) Merge pull request #19466 from open-webui/dev - [`d95f533`](https://github.com/open-webui/open-webui/commit/d95f533214e3fe5beb5e41ec1f349940bc4c7043) Merge pull request #19729 from open-webui/dev - [`a727153`](https://github.com/open-webui/open-webui/commit/a7271532f8a38da46785afcaa7e65f9a45e7d753) 0.6.43 (#20093) - [`6adde20`](https://github.com/open-webui/open-webui/commit/6adde203cd292a9e3af9c64a2ae36b603fed096a) Merge pull request #20394 from open-webui/dev - [`f9b0534`](https://github.com/open-webui/open-webui/commit/f9b0534e0c442631d1cb7205169588b9b6204179) Merge pull request #20522 from open-webui/dev ### 📊 Changes **7 files changed** (+626 additions, -54 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+35 -5) ➕ `backend/open_webui/retrieval/loaders/table.py` (+213 -0) 📝 `backend/open_webui/routers/retrieval.py` (+70 -49) ➕ `backend/open_webui/test/retrieval/__init__.py` (+0 -0) ➕ `backend/open_webui/test/retrieval/loaders/__init__.py` (+0 -0) ➕ `backend/open_webui/test/retrieval/loaders/test_table.py` (+302 -0) </details> ### 📄 Description ## Summary Closes discussion #22319 Replace `CSVLoader` and `UnstructuredExcelLoader` with custom table-aware loaders that preserve row integrity for better RAG retrieval on tabular data. - **Row integrity**: never splits mid-record, each chunk contains complete rows - **Column context**: headers repeated in every chunk so the LLM can interpret values - **Delimiter auto-detection**: comma, semicolon, tab, pipe (handles EU exports) - **Multi-sheet Excel**: per-sheet chunking with sheet name metadata - **Skip text splitting**: pre-chunked table docs bypass the text splitter - **Configurable**: `TABLE_ROWS_PER_CHUNK` env var / config (default: 1) ## Changed files - `backend/open_webui/retrieval/loaders/table.py` — new `TableAwareCSVLoader`, `TableAwareExcelLoader` - `backend/open_webui/retrieval/loaders/main.py` — route CSV/TSV/Excel to new loaders - `backend/open_webui/routers/retrieval.py` — bypass text splitting for `file_type="table"` docs - `backend/open_webui/config.py` — add `TABLE_ROWS_PER_CHUNK` ## Testing - [x] 24 unit tests (delimiter detection, chunking, metadata, encoding fallback, empty files, multi-sheet Excel, invalid file handling) - [x] All tests passing locally (`pytest`) - [x] Personally tested: upload CSV with semicolons, verified chunks preserve rows with headers - [x] Personally tested: upload multi-sheet Excel, verified per-sheet chunking with sheet metadata - [x] Personally tested: TABLE_ROWS_PER_CHUNK config respected (tested with 1 and 5) ## Contributor License Agreement I have read and agree to the [Contributor License Agreement](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 14:13:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#42242