[GH-ISSUE #13281] feat: Integrate a Tabular-Data Analyzer (Excel/CSV → SQL pipeline) into Open WebUI RAG & Tools #16875

New Issue

GiteaMirror · 2026-04-19T22:42:13-05:00

GiteaMirror commented

2026-04-19 22:42:13 -05:00

Originally created by @openlabollioules on GitHub (Apr 28, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/13281

Check Existing Issues

I have searched the existing issues and discussions.

Problem Description

Open WebUI already offers a Retrieval-Augmented Generation (RAG) pipeline for unstructured documents and web pages, and its roadmap explicitly states that this RAG layer will become “highly modular and extensively configurable” to address current limitations.
However, when the user uploads an Excel or CSV file, the existing pipeline tokenises the sheet as raw text and relies on semantic similarity between long cell strings. This approach often breaks on tabular data: numeric context is lost, column semantics disappear, and the model hallucinates or returns empty hits, as practitioners have reported in blogs and tutorials.

Desired Solution you'd like

Embed a Data-Analyzer tool that converts each incoming spreadsheet into an in-memory SQL (e.g., SQLite/DuckDB) table and exposes it to the LLM agent layer:

File ingestion

Parse Excel/CSV via pandas.read_excel/read_csv, infer types, create normalised tables.

Store a lightweight .db file in the existing /backend/data/documents volume so that persistence and multi-user RBAC keep working.

Query agent

On every user question, the agent drafts a SQL query against the table, executes it, then feeds the result (plus a short JSON schema) back to the LLM prompt.

This mirrors the “Data Interpreter” pattern proposed in the recent arXiv paper (94.9 % accuracy on DABench)
and in practical agent examples in MetaGPT docs.

Integration points

As a Tool – Open WebUI’s plugin framework already allows Python-based tools that the LLM can invoke by name. A data_analyzer tool would be discoverable exactly like the existing weather or stock examples.

As a RAG pre-processor – Detect mime-type=csv|xlsx; if true, route through the Data Analyzer before embedding or hybrid BM25 search. That keeps the current RAG UX unchanged for users who simply drop a spreadsheet into the document inbox.

Benefits
Accuracy boost on spreadsheets – Converting to relational form removes the need to embed entire rows as dense vectors and lets the model reason over explicit column names and numeric values, eliminating many hallucinations observed with vanilla RAG.

Performance – SQL filtering means the agent only embeds concise query results (often < 2 KB) into the chat context, preserving Ollama/OpenAI context tokens.

Modularity – Fits the “Tools → Functions → Pipelines” architecture so users can opt-in or disable per workspace.

Community adoption – Several Open WebUI users already run ad-hoc pandas notebooks to analyse data making this first-class removes friction.

I published an open-source PoC called Data-Interpreter-IA (https://github.com/openlabollioules/Data-Interpreter-IA) that:

Accepts .xlsx/.csv, builds an SQLite DB,
Generates SQL via an LLM agent,
Streams results back to the chat.
Feel free to reuse or cherry-pick any part of that code base.
Already converted in the OpenWebui pipeline format.

Here is the explanation of my project :

Alternatives Considered

RAG on cell-wise embeddings (tested – poor recall on numeric columns).

Forcing users to install external BI tools; harms the “self-hosted, offline-first” philosophy.

Additional Context

Contributes to the Information Retrieval focus area in the roadmap.

Would also open the door to integrating code-execution notebooks like Open Interpreter for advanced analyses.

A ready-made feature-request template was followed when writing this issue.

Originally created by @openlabollioules on GitHub (Apr 28, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/13281 ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description Open WebUI already offers a Retrieval-Augmented Generation (RAG) pipeline for unstructured documents and web pages, and its roadmap explicitly states that this RAG layer will become “highly modular and extensively configurable” to address current limitations. However, when the user uploads an Excel or CSV file, the existing pipeline tokenises the sheet as raw text and relies on semantic similarity between long cell strings. This approach often breaks on tabular data: numeric context is lost, column semantics disappear, and the model hallucinates or returns empty hits, as practitioners have reported in blogs and tutorials. ### Desired Solution you'd like Embed a Data-Analyzer tool that converts each incoming spreadsheet into an in-memory SQL (e.g., SQLite/DuckDB) table and exposes it to the LLM agent layer: - File ingestion Parse Excel/CSV via pandas.read_excel/read_csv, infer types, create normalised tables. Store a lightweight .db file in the existing /backend/data/documents volume so that persistence and multi-user RBAC keep working. - Query agent On every user question, the agent drafts a SQL query against the table, executes it, then feeds the result (plus a short JSON schema) back to the LLM prompt. This mirrors the “Data Interpreter” pattern proposed in the recent arXiv paper (94.9 % accuracy on DABench) and in practical agent examples in MetaGPT docs. - Integration points As a Tool – Open WebUI’s plugin framework already allows Python-based tools that the LLM can invoke by name. A data_analyzer tool would be discoverable exactly like the existing weather or stock examples. As a RAG pre-processor – Detect mime-type=csv|xlsx; if true, route through the Data Analyzer before embedding or hybrid BM25 search. That keeps the current RAG UX unchanged for users who simply drop a spreadsheet into the document inbox. - Benefits Accuracy boost on spreadsheets – Converting to relational form removes the need to embed entire rows as dense vectors and lets the model reason over explicit column names and numeric values, eliminating many hallucinations observed with vanilla RAG. Performance – SQL filtering means the agent only embeds concise query results (often < 2 KB) into the chat context, preserving Ollama/OpenAI context tokens. Modularity – Fits the “Tools → Functions → Pipelines” architecture so users can opt-in or disable per workspace. Community adoption – Several Open WebUI users already run ad-hoc pandas notebooks to analyse data making this first-class removes friction. I published an open-source PoC called Data-Interpreter-IA (https://github.com/openlabollioules/Data-Interpreter-IA) that: - Accepts .xlsx/.csv, builds an SQLite DB, - Generates SQL via an LLM agent, - Streams results back to the chat. - Feel free to reuse or cherry-pick any part of that code base. - Already converted in the OpenWebui pipeline format. **Here is the explanation of my project :** ![Image](https://github.com/user-attachments/assets/90a12a79-8ec3-4e7d-8875-f6d940028c26) ### Alternatives Considered RAG on cell-wise embeddings (tested – poor recall on numeric columns). Forcing users to install external BI tools; harms the “self-hosted, offline-first” philosophy. ### Additional Context Contributes to the Information Retrieval focus area in the roadmap. Would also open the door to integrating code-execution notebooks like Open Interpreter for advanced analyses. A ready-made feature-request template was followed when writing this issue.

GiteaMirror closed this issue

2026-04-19 22:42:14 -05:00

GiteaMirror commented

2026-04-19 22:42:15 -05:00

@tjbck commented on GitHub (Apr 28, 2025):

Should be supported via Functions.

@tjbck commented on GitHub (Apr 28, 2025): Should be supported via Functions.

GiteaMirror commented

2026-04-19 22:42:16 -05:00

@openlabollioules commented on GitHub (Apr 29, 2025):

What it means clearly ?

@openlabollioules commented on GitHub (Apr 29, 2025): What it means clearly ?