[PR #22320] [CLOSED] feat: table-aware RAG ingestion for CSV, TSV, and Excel files #26611

Closed
opened 2026-04-20 06:36:11 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22320
Author: @salim4n
Created: 3/6/2026
Status: Closed

Base: mainHead: feat/table-aware-rag-ingestion


📝 Commits (3)

  • 356aa37 feat: table-aware RAG ingestion for CSV, TSV, and Excel files
  • b720726 test: add unit tests for table-aware CSV and Excel loaders
  • 3d72a8d refactor: change TABLE_ROWS_PER_CHUNK default from 5 to 1

📊 Changes

7 files changed (+626 additions, -54 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+35 -5)
backend/open_webui/retrieval/loaders/table.py (+213 -0)
📝 backend/open_webui/routers/retrieval.py (+70 -49)
backend/open_webui/test/retrieval/__init__.py (+0 -0)
backend/open_webui/test/retrieval/loaders/__init__.py (+0 -0)
backend/open_webui/test/retrieval/loaders/test_table.py (+302 -0)

📄 Description

Summary

Closes discussion #22319

Replace CSVLoader and UnstructuredExcelLoader with custom table-aware loaders that preserve row integrity for better RAG retrieval on tabular data.

  • Row integrity: never splits mid-record, each chunk contains complete rows
  • Column context: headers repeated in every chunk so the LLM can interpret values
  • Delimiter auto-detection: comma, semicolon, tab, pipe (handles EU exports)
  • Multi-sheet Excel: per-sheet chunking with sheet name metadata
  • Skip text splitting: pre-chunked table docs bypass the text splitter
  • Configurable: TABLE_ROWS_PER_CHUNK env var / config (default: 1)

Changed files

  • backend/open_webui/retrieval/loaders/table.py — new TableAwareCSVLoader, TableAwareExcelLoader
  • backend/open_webui/retrieval/loaders/main.py — route CSV/TSV/Excel to new loaders
  • backend/open_webui/routers/retrieval.py — bypass text splitting for file_type="table" docs
  • backend/open_webui/config.py — add TABLE_ROWS_PER_CHUNK

Test plan

  • 24 unit tests (delimiter detection, chunking, metadata, encoding fallback, empty files, multi-sheet Excel, invalid file handling)
  • All tests passing locally (pytest)
  • Manual test: upload CSV with semicolons → verify chunks preserve rows
  • Manual test: upload multi-sheet Excel → verify per-sheet chunking
  • Manual test: verify TABLE_ROWS_PER_CHUNK config is respected

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22320 **Author:** [@salim4n](https://github.com/salim4n) **Created:** 3/6/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feat/table-aware-rag-ingestion` --- ### 📝 Commits (3) - [`356aa37`](https://github.com/open-webui/open-webui/commit/356aa37ace8ae1c75571277440d0cd23d0b3a7eb) feat: table-aware RAG ingestion for CSV, TSV, and Excel files - [`b720726`](https://github.com/open-webui/open-webui/commit/b7207267f601d2c46d2cd1ca37e0f64d870e5359) test: add unit tests for table-aware CSV and Excel loaders - [`3d72a8d`](https://github.com/open-webui/open-webui/commit/3d72a8d2a9b9f7d7abf387bc62ed68403cc207a8) refactor: change TABLE_ROWS_PER_CHUNK default from 5 to 1 ### 📊 Changes **7 files changed** (+626 additions, -54 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+35 -5) ➕ `backend/open_webui/retrieval/loaders/table.py` (+213 -0) 📝 `backend/open_webui/routers/retrieval.py` (+70 -49) ➕ `backend/open_webui/test/retrieval/__init__.py` (+0 -0) ➕ `backend/open_webui/test/retrieval/loaders/__init__.py` (+0 -0) ➕ `backend/open_webui/test/retrieval/loaders/test_table.py` (+302 -0) </details> ### 📄 Description ## Summary Closes discussion #22319 Replace `CSVLoader` and `UnstructuredExcelLoader` with custom table-aware loaders that preserve row integrity for better RAG retrieval on tabular data. - **Row integrity**: never splits mid-record, each chunk contains complete rows - **Column context**: headers repeated in every chunk so the LLM can interpret values - **Delimiter auto-detection**: comma, semicolon, tab, pipe (handles EU exports) - **Multi-sheet Excel**: per-sheet chunking with sheet name metadata - **Skip text splitting**: pre-chunked table docs bypass the text splitter - **Configurable**: `TABLE_ROWS_PER_CHUNK` env var / config (default: 1) ## Changed files - `backend/open_webui/retrieval/loaders/table.py` — new `TableAwareCSVLoader`, `TableAwareExcelLoader` - `backend/open_webui/retrieval/loaders/main.py` — route CSV/TSV/Excel to new loaders - `backend/open_webui/routers/retrieval.py` — bypass text splitting for `file_type="table"` docs - `backend/open_webui/config.py` — add `TABLE_ROWS_PER_CHUNK` ## Test plan - [x] 24 unit tests (delimiter detection, chunking, metadata, encoding fallback, empty files, multi-sheet Excel, invalid file handling) - [x] All tests passing locally (`pytest`) - [ ] Manual test: upload CSV with semicolons → verify chunks preserve rows - [ ] Manual test: upload multi-sheet Excel → verify per-sheet chunking - [ ] Manual test: verify `TABLE_ROWS_PER_CHUNK` config is respected 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 06:36:11 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#26611