[GH-ISSUE #22319] feat: table-aware RAG ingestion for CSV, TSV, and Excel files #58363

New Issue

GiteaMirror · 2026-05-05T23:02:44-05:00

GiteaMirror commented

2026-05-05 23:02:44 -05:00

Originally created by @salim4n on GitHub (Mar 6, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/22319

Check Existing Issues

I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

When uploading CSV, TSV, or Excel files, the current RAG pipeline uses CSVLoader and UnstructuredExcelLoader which split tabular data as plain text. This causes:

Row integrity loss: rows get split across chunks mid-record
Missing column context: chunks don't include headers, so the LLM can't interpret values
Poor delimiter handling: CSVLoader assumes commas, fails on semicolons/tabs common in EU exports
No multi-sheet support: Excel workbooks lose sheet-level structure

Desired Solution you'd like

Replace the default loaders with table-aware alternatives that:

Preserve complete rows (never split mid-record)
Repeat column headers in every chunk for LLM context
Auto-detect delimiters (comma, semicolon, tab, pipe)
Handle multi-sheet Excel with per-sheet metadata
Skip text splitting for pre-chunked tabular documents

Configurable via TABLE_ROWS_PER_CHUNK env var (default: 1 row per chunk for precise retrieval).

Scope

backend/open_webui/retrieval/loaders/table.py — new TableAwareCSVLoader, TableAwareExcelLoader
backend/open_webui/retrieval/loaders/main.py — routing CSV/TSV/Excel to new loaders
backend/open_webui/routers/retrieval.py — bypass text splitting for table docs
backend/open_webui/config.py — new TABLE_ROWS_PER_CHUNK config entry

Example

Before (CSVLoader):
Jean;Dupont;Par
is;30
Marie;Martin;Ly
on;25

Tests

24 unit tests covering: delimiter detection (comma, semicolon, tab, pipe), chunking, metadata, encoding fallback (latin-1), empty
files, multi-sheet Excel, invalid file handling.

I have a working implementation ready to submit as a PR if this is welcome.

Alternatives Considered

No response

Additional Context

No response

Originally created by @salim4n on GitHub (Mar 6, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/22319 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description When uploading CSV, TSV, or Excel files, the current RAG pipeline uses `CSVLoader` and `UnstructuredExcelLoader` which split tabular data as plain text. This causes: - **Row integrity loss**: rows get split across chunks mid-record - **Missing column context**: chunks don't include headers, so the LLM can't interpret values - **Poor delimiter handling**: CSVLoader assumes commas, fails on semicolons/tabs common in EU exports - **No multi-sheet support**: Excel workbooks lose sheet-level structure ### Desired Solution you'd like Replace the default loaders with table-aware alternatives that: - Preserve complete rows (never split mid-record) - Repeat column headers in every chunk for LLM context - Auto-detect delimiters (comma, semicolon, tab, pipe) - Handle multi-sheet Excel with per-sheet metadata - Skip text splitting for pre-chunked tabular documents Configurable via `TABLE_ROWS_PER_CHUNK` env var (default: 1 row per chunk for precise retrieval). ## Scope - `backend/open_webui/retrieval/loaders/table.py` — new `TableAwareCSVLoader`, `TableAwareExcelLoader` - `backend/open_webui/retrieval/loaders/main.py` — routing CSV/TSV/Excel to new loaders - `backend/open_webui/routers/retrieval.py` — bypass text splitting for table docs - `backend/open_webui/config.py` — new `TABLE_ROWS_PER_CHUNK` config entry ## Example Before (CSVLoader): Jean;Dupont;Par is;30 Marie;Martin;Ly on;25 After (TableAwareCSVLoader): Columns: prenom | nom | ville | age Row 0: Jean | Dupont | Paris | 30 ## Tests 24 unit tests covering: delimiter detection (comma, semicolon, tab, pipe), chunking, metadata, encoding fallback (latin-1), empty files, multi-sheet Excel, invalid file handling. I have a working implementation ready to submit as a PR if this is welcome. ### Alternatives Considered _No response_ ### Additional Context _No response_

GiteaMirror closed this issue

2026-05-05 23:02:44 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#58363