[PR #16209] [CLOSED] feat: Add RAG grounding step (extension to Google embeddings) #10868

New Issue

GiteaMirror · 2025-11-11T19:15:59-06:00

GiteaMirror commented

2025-11-11 19:15:59 -06:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/16209
Author: @ipapapa
Created: 8/1/2025
Status: ❌ Closed

Base: main ← Head: feat/add-grounding-step-clean

📝 Commits (6)

ffff6fa feat: Add Google embeddings support
d7262aa Merge branch 'open-webui:main' into fix-google-embeddings
b21509d feat: Add Google embeddings support with migration guidance
2a793a7 feat: add RAG grounding step as extension to Google embeddings
1c5a0f5 feat: integrate grounding step into retrieval pipeline
050811a feat: add comprehensive tests for grounding step

📊 Changes

5 files changed (+676 additions, -1 deletions)

View changed files

📝 README.md (+41 -0)
📝 backend/open_webui/config.py (+12 -0)
➕ backend/open_webui/retrieval/grounding.py (+153 -0)
📝 backend/open_webui/retrieval/utils.py (+123 -1)
➕ backend/open_webui/test/retrieval/test_grounding.py (+347 -0)

📄 Description

Summary

This PR adds a lightweight grounding step after retrieval to prevent semantic drift when using different embedding models. This addresses a well-documented problem in RAG systems where retrieved content appears relevant but generates off-topic responses due to embedding model inconsistencies.

This PR extends and builds upon #16022 (Google embeddings support).

Recent academic research has identified significant challenges with embedding model mismatch in RAG systems:

Contextual Drift: "The gradual loss of relevance between retrieved data and the user's query" (arXiv:2409.14924v1)
Cross-Provider Inconsistency: "Variant embedding models exhibit different benefits across various areas, often leading to different similarity calculation results" (arXiv:2507.17442)
Model Drift Effects: "Gradual degradation in the model's performance" when switching between embedding providers

Solution: Post-Retrieval Validation

Our implementation follows grounding techniques.

Key Features

Re-embedding Validation: Retrieved documents are re-embedded using the same embedding function as the query for consistency
Similarity-Based Filtering: Documents below a configurable threshold are filtered out (similar to Anthropic's Contextual Retrieval approach)
Provider Agnostic: Works across all embedding providers (OpenAI, Google, Cohere, Ollama, etc.)
Graceful Degradation: Returns original documents if validation fails
Performance Monitoring: Provides filtering statistics and confidence metrics

Configuration

# Enable post-retrieval validation
RAG_ENABLE_GROUNDING_STEP=true

# Set similarity threshold (0.0 to 1.0)
RAG_GROUNDING_THRESHOLD=0.3

Technical Approach

Standard Retrieval: Documents retrieved using existing embedding search
Re-embedding: Retrieved documents re-embedded with same model/function as query
Cosine Similarity: Calculate similarity between query and document embeddings
Threshold Filtering: Remove documents below similarity threshold
Return Validated Set: Only semantically aligned documents passed to LLM

Research Validation

This approach is supported by recent academic work:

Anthropic's Contextual Retrieval: Similar techniques show "49% reduction in failed retrievals"
Confident RAG (2025): Multi-embedding validation shows "10% improvement over vanilla RAG"
GaRAGe Benchmark: Emphasizes importance of grounding validation in RAG pipelines

Addresses Community Feedback

This implementation responds to discussion in #16043, specifically addressing concerns about cross-embedding provider semantic alignment and the need for validation layers in multi-provider RAG systems.

Testing

✅ Comprehensive test suite covering all embedding providers
✅ Various threshold configurations and edge cases
✅ Graceful error handling and fallback scenarios
✅ User context integration testing

Backward Compatibility

Fully backward compatible - disabled by default
Zero performance impact when feature is turned off
Optional configuration - preserves existing behavior

Performance Impact

When enabled:

Minimal latency increase - only re-embeds retrieved documents (typically 3-10 docs)
Improved accuracy - filters out semantically misaligned content
Reduced hallucination - LLM receives more contextually relevant information

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/16209 **Author:** [@ipapapa](https://github.com/ipapapa) **Created:** 8/1/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feat/add-grounding-step-clean` --- ### 📝 Commits (6) - [`ffff6fa`](https://github.com/open-webui/open-webui/commit/ffff6fa23af9288a6e669129000600a86de3032c) feat: Add Google embeddings support - [`d7262aa`](https://github.com/open-webui/open-webui/commit/d7262aa2c9f2f38eaedd7bf44ad5ae2c1ad306c1) Merge branch 'open-webui:main' into fix-google-embeddings - [`b21509d`](https://github.com/open-webui/open-webui/commit/b21509d445c2005f662ca6fd46ec0ddee260cd2e) feat: Add Google embeddings support with migration guidance - [`2a793a7`](https://github.com/open-webui/open-webui/commit/2a793a727371b29e0aeae4186a8e118fa012e7b4) feat: add RAG grounding step as extension to Google embeddings - [`1c5a0f5`](https://github.com/open-webui/open-webui/commit/1c5a0f5d6432a4a1f936f7411a8302d9c560871e) feat: integrate grounding step into retrieval pipeline - [`050811a`](https://github.com/open-webui/open-webui/commit/050811ad2700e41801d37c98624aeb55c186536a) feat: add comprehensive tests for grounding step ### 📊 Changes **5 files changed** (+676 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `README.md` (+41 -0) 📝 `backend/open_webui/config.py` (+12 -0) ➕ `backend/open_webui/retrieval/grounding.py` (+153 -0) 📝 `backend/open_webui/retrieval/utils.py` (+123 -1) ➕ `backend/open_webui/test/retrieval/test_grounding.py` (+347 -0) </details> ### 📄 Description ## Summary This PR adds a **lightweight grounding step** after retrieval to prevent **semantic drift** when using different embedding models. This addresses a well-documented problem in RAG systems where retrieved content appears relevant but generates off-topic responses due to embedding model inconsistencies. **This PR extends and builds upon #16022 (Google embeddings support).** Recent academic research has identified significant challenges with **embedding model mismatch** in RAG systems: - **Contextual Drift**: "The gradual loss of relevance between retrieved data and the user's query" (arXiv:2409.14924v1) - **Cross-Provider Inconsistency**: "Variant embedding models exhibit different benefits across various areas, often leading to different similarity calculation results" ([arXiv:2507.17442](https://arxiv.org/abs/2507.17442)) - **Model Drift Effects**: "Gradual degradation in the model's performance" when switching between embedding providers ## Solution: Post-Retrieval Validation Our implementation follows **grounding techniques**. ### Key Features - **Re-embedding Validation**: Retrieved documents are re-embedded using the same embedding function as the query for consistency - **Similarity-Based Filtering**: Documents below a configurable threshold are filtered out (similar to Anthropic's Contextual Retrieval approach) - **Provider Agnostic**: Works across all embedding providers (OpenAI, Google, Cohere, Ollama, etc.) - **Graceful Degradation**: Returns original documents if validation fails - **Performance Monitoring**: Provides filtering statistics and confidence metrics ### Configuration ```bash # Enable post-retrieval validation RAG_ENABLE_GROUNDING_STEP=true # Set similarity threshold (0.0 to 1.0) RAG_GROUNDING_THRESHOLD=0.3 ``` ## Technical Approach 1. **Standard Retrieval**: Documents retrieved using existing embedding search 2. **Re-embedding**: Retrieved documents re-embedded with same model/function as query 3. **Cosine Similarity**: Calculate similarity between query and document embeddings 4. **Threshold Filtering**: Remove documents below similarity threshold 5. **Return Validated Set**: Only semantically aligned documents passed to LLM ## Research Validation This approach is supported by recent academic work: - **Anthropic's Contextual Retrieval**: Similar techniques show "49% reduction in failed retrievals" - **Confident RAG (2025)**: Multi-embedding validation shows "10% improvement over vanilla RAG" - **GaRAGe Benchmark**: Emphasizes importance of grounding validation in RAG pipelines ## Addresses Community Feedback This implementation responds to discussion in #16043, specifically addressing concerns about cross-embedding provider semantic alignment and the need for validation layers in multi-provider RAG systems. ## Testing - ✅ Comprehensive test suite covering all embedding providers - ✅ Various threshold configurations and edge cases - ✅ Graceful error handling and fallback scenarios - ✅ User context integration testing ## Backward Compatibility - **Fully backward compatible** - disabled by default - **Zero performance impact** when feature is turned off - **Optional configuration** - preserves existing behavior ## Performance Impact When enabled: - **Minimal latency increase** - only re-embeds retrieved documents (typically 3-10 docs) - **Improved accuracy** - filters out semantically misaligned content - **Reduced hallucination** - LLM receives more contextually relevant information ## Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2025-11-11 19:15:59 -06:00

GiteaMirror closed this issue

2025-11-11 19:15:59 -06:00

GiteaMirror referenced this issue

2026-04-20 04:16:12 -05:00

[PR #10868] [MERGED] chore: use logging.getLevelNamesMapping() for validating log level #22610

GiteaMirror referenced this issue

2026-04-25 11:20:38 -05:00

[PR #10868] [MERGED] chore: use logging.getLevelNamesMapping() for validating log level #38240

GiteaMirror referenced this issue

2026-04-29 20:06:53 -05:00

[PR #10868] [MERGED] chore: use logging.getLevelNamesMapping() for validating log level #45658