[PR #12050] [MERGED] Fix: Normalize all database distances to score in [0, 1] (needs testing for different DBs) #45890

Closed
opened 2026-04-29 20:28:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/12050
Author: @mahenning
Created: 3/25/2025
Status: Merged
Merged: 3/27/2025
Merged by: @tjbck

Base: devHead: fix-db-order


📝 Commits (3)

  • 94d9d3d Fix: Normalze all database distances to score in [0, 1]
  • 7531b7d Satisfy github format check
  • 7490bc9 Merge branch 'dev' into fix-db-order

📊 Changes

6 files changed (+22 additions, -29 deletions)

View changed files

📝 backend/open_webui/retrieval/utils.py (+5 -24)
📝 backend/open_webui/retrieval/vector/dbs/chroma.py (+7 -1)
📝 backend/open_webui/retrieval/vector/dbs/milvus.py (+4 -1)
📝 backend/open_webui/retrieval/vector/dbs/opensearch.py (+1 -1)
📝 backend/open_webui/retrieval/vector/dbs/pgvector.py (+3 -1)
📝 backend/open_webui/retrieval/vector/dbs/qdrant.py (+2 -1)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Ths PR intends to normalize all "distance" scores from the different databases to return values between 0 and 1, with 1 as the best value
  • It corrects the "Relevance" score for some of these databases

Fixed

  • Fixes distance score for all supported databases to be in [0, 1] range

Additional Information

  • The "distance" score given by different databases when using "cosine" is different. Sometimes it is in -1 to 1, sometimes it is in 2 to 0
  • In this PR the "distance" given by the different databases are all normalized to a score in [0, 1]
  • This is important for non-hybrid RAG, as the "Relevance" score expects values in [0, 1]
  • Also, the result ordering was only reverse for chroma, while it would be needed for pgvector too

Sources:

Chromadb: https://docs.trychroma.com/docs/collections/configure

  • Formula in the box for cosine, "1 - cosine similarity" is [2, 0] as distance (0 as best "distance")

Elasticsearch: https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring

  • Formula (5). Elasticsearch already normalizes in [0, 1], no work needed.

Milvus: https://milvus.io/docs/metric.md

  • "COSINE" in the table below the "Note" block. Uses raw [-1, 1] cos sim

Opensearch: Implementation in open-webui used [0, 2] with shift +1, see code snippet below
4906af9319/backend/open_webui/retrieval/vector/dbs/opensearch.py (L123)

Pgvector: https://github.com/pgvector/pgvector?tab=readme-ov-file#querying
https://github.com/supabase/supabase/issues/12244

  • Lists cosine "distance" as metric in the first link, second link confirms it is "1 - cossim", which is again [2, 0] same as chromadb

qdrant: https://qdrant.tech/documentation/concepts/collections/#collections

  • Links "cosine" to the wikipedia cosine similarity, which is [-1, 1]

  • Tested for: ChromaDB, PGVector (by almajo)
  • Partly tested in a script: qdrant (I just tested cosine output of 2 handwritten opposing vectors to confirm [-1, 1], not tested in OWUI. But I'm pretty sure about the [-1, 1])
  • Not yet tested: elassticsearch, milvus, opensearch (testing welcome from people who use these databases!!)

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/12050 **Author:** [@mahenning](https://github.com/mahenning) **Created:** 3/25/2025 **Status:** ✅ Merged **Merged:** 3/27/2025 **Merged by:** [@tjbck](https://github.com/tjbck) **Base:** `dev` ← **Head:** `fix-db-order` --- ### 📝 Commits (3) - [`94d9d3d`](https://github.com/open-webui/open-webui/commit/94d9d3d59088bd45664d89f7ec9ec033e2bdbc17) Fix: Normalze all database distances to score in [0, 1] - [`7531b7d`](https://github.com/open-webui/open-webui/commit/7531b7dcaa3b13930f213870688c85eef2ff36c9) Satisfy github format check - [`7490bc9`](https://github.com/open-webui/open-webui/commit/7490bc910057b2b274e67f8a15df401bc446df60) Merge branch 'dev' into fix-db-order ### 📊 Changes **6 files changed** (+22 additions, -29 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/utils.py` (+5 -24) 📝 `backend/open_webui/retrieval/vector/dbs/chroma.py` (+7 -1) 📝 `backend/open_webui/retrieval/vector/dbs/milvus.py` (+4 -1) 📝 `backend/open_webui/retrieval/vector/dbs/opensearch.py` (+1 -1) 📝 `backend/open_webui/retrieval/vector/dbs/pgvector.py` (+3 -1) 📝 `backend/open_webui/retrieval/vector/dbs/qdrant.py` (+2 -1) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [ ] **Testing:** Have you written and run sufficient tests for validating the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To cleary categorize this pull request, prefix the pull request title, using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - Ths PR intends to normalize all "distance" scores from the different databases to return values between 0 and 1, with 1 as the best value - It corrects the "Relevance" score for some of these databases ### Fixed - Fixes distance score for all supported databases to be in [0, 1] range --- ### Additional Information - The "distance" score given by different databases when using "cosine" is different. Sometimes it is in -1 to 1, sometimes it is in 2 to 0 - In this PR the "distance" given by the different databases are all normalized to a score in [0, 1] - This is important for non-hybrid RAG, as the "Relevance" score expects values in [0, 1] - Also, the result ordering was only reverse for chroma, while it would be needed for pgvector too #### Sources: Chromadb: https://docs.trychroma.com/docs/collections/configure - Formula in the box for cosine, "1 - cosine similarity" is [2, 0] as distance (0 as best "distance") Elasticsearch: https://www.elastic.co/search-labs/blog/vector-similarity-techniques-and-scoring - Formula (5). Elasticsearch already normalizes in [0, 1], no work needed. Milvus: https://milvus.io/docs/metric.md - "COSINE" in the table below the "Note" block. Uses raw [-1, 1] cos sim Opensearch: Implementation in open-webui used [0, 2] with shift +1, see code snippet below https://github.com/open-webui/open-webui/blob/4906af93191dec143a19dc250a11da2a94d1d6e8/backend/open_webui/retrieval/vector/dbs/opensearch.py#L123 Pgvector: https://github.com/pgvector/pgvector?tab=readme-ov-file#querying https://github.com/supabase/supabase/issues/12244 - Lists cosine "distance" as metric in the first link, second link confirms it is "1 - cossim", which is again [2, 0] same as chromadb qdrant: https://qdrant.tech/documentation/concepts/collections/#collections - Links "cosine" to the wikipedia cosine similarity, which is [-1, 1] ---- - Tested for: ChromaDB, PGVector (by [almajo](https://github.com/almajo)) - Partly tested in a script: qdrant (I just tested cosine output of 2 handwritten opposing vectors to confirm [-1, 1], not tested in OWUI. But I'm pretty sure about the [-1, 1]) - Not yet tested: elassticsearch, milvus, opensearch (testing welcome from people who use these databases!!) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 20:28:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#45890