mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 11:28:35 -05:00
[GH-ISSUE #13281] feat: Integrate a Tabular-Data Analyzer (Excel/CSV → SQL pipeline) into Open WebUI RAG & Tools #16875
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @openlabollioules on GitHub (Apr 28, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/13281
Check Existing Issues
Problem Description
Open WebUI already offers a Retrieval-Augmented Generation (RAG) pipeline for unstructured documents and web pages, and its roadmap explicitly states that this RAG layer will become “highly modular and extensively configurable” to address current limitations.
However, when the user uploads an Excel or CSV file, the existing pipeline tokenises the sheet as raw text and relies on semantic similarity between long cell strings. This approach often breaks on tabular data: numeric context is lost, column semantics disappear, and the model hallucinates or returns empty hits, as practitioners have reported in blogs and tutorials.
Desired Solution you'd like
Embed a Data-Analyzer tool that converts each incoming spreadsheet into an in-memory SQL (e.g., SQLite/DuckDB) table and exposes it to the LLM agent layer:
Parse Excel/CSV via pandas.read_excel/read_csv, infer types, create normalised tables.
Store a lightweight .db file in the existing /backend/data/documents volume so that persistence and multi-user RBAC keep working.
On every user question, the agent drafts a SQL query against the table, executes it, then feeds the result (plus a short JSON schema) back to the LLM prompt.
This mirrors the “Data Interpreter” pattern proposed in the recent arXiv paper (94.9 % accuracy on DABench)
and in practical agent examples in MetaGPT docs.
As a Tool – Open WebUI’s plugin framework already allows Python-based tools that the LLM can invoke by name. A data_analyzer tool would be discoverable exactly like the existing weather or stock examples.
As a RAG pre-processor – Detect mime-type=csv|xlsx; if true, route through the Data Analyzer before embedding or hybrid BM25 search. That keeps the current RAG UX unchanged for users who simply drop a spreadsheet into the document inbox.
Accuracy boost on spreadsheets – Converting to relational form removes the need to embed entire rows as dense vectors and lets the model reason over explicit column names and numeric values, eliminating many hallucinations observed with vanilla RAG.
Performance – SQL filtering means the agent only embeds concise query results (often < 2 KB) into the chat context, preserving Ollama/OpenAI context tokens.
Modularity – Fits the “Tools → Functions → Pipelines” architecture so users can opt-in or disable per workspace.
Community adoption – Several Open WebUI users already run ad-hoc pandas notebooks to analyse data making this first-class removes friction.
I published an open-source PoC called Data-Interpreter-IA (https://github.com/openlabollioules/Data-Interpreter-IA) that:
Accepts .xlsx/.csv, builds an SQLite DB,
Generates SQL via an LLM agent,
Streams results back to the chat.
Feel free to reuse or cherry-pick any part of that code base.
Already converted in the OpenWebui pipeline format.
Here is the explanation of my project :
Alternatives Considered
RAG on cell-wise embeddings (tested – poor recall on numeric columns).
Forcing users to install external BI tools; harms the “self-hosted, offline-first” philosophy.
Additional Context
Contributes to the Information Retrieval focus area in the roadmap.
Would also open the door to integrating code-execution notebooks like Open Interpreter for advanced analyses.
A ready-made feature-request template was followed when writing this issue.
@tjbck commented on GitHub (Apr 28, 2025):
Should be supported via Functions.
@openlabollioules commented on GitHub (Apr 29, 2025):
What it means clearly ?