[PR #11234] [CLOSED] feat: Adding new Function for custom Knowledge parsing #22693

Closed
opened 2026-04-20 04:19:24 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/11234
Author: @DanielDowns
Created: 3/5/2025
Status: Closed

Base: devHead: main


📝 Commits (9)

  • 3a404c9 added parsing type and basic tracing
  • e0f5487 default parsing now properly parses and uploads
  • f8e201f handle all custom parsers and switch to default if none are provided
  • 7dd06c6 removed original save function thats now in plugin. moved get_plugin() to correct module. DefaultParser is now named such with proper structure and logging
  • 1a51584 Merge pull request #10939 from open-webui/dev
  • 4770285 Merge pull request #11211 from open-webui/dev
  • d512a68 Merge branch 'main' of https://github.com/DanielDowns/open-webui into parser_addition
  • 39c5512 Merge pull request #1 from DanielDowns/parser_addition
  • d06b6d7 removed testing print statement

📊 Changes

4 files changed (+325 additions, -200 deletions)

View changed files

📝 backend/open_webui/functions.py (+35 -1)
📝 backend/open_webui/routers/retrieval.py (+68 -199)
backend/open_webui/utils/parser.py (+220 -0)
📝 backend/open_webui/utils/plugin.py (+2 -0)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests for validating the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title, using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Current plugins offer a lot of flexibility but none hook into how Knowledge is uploaded. This prohibits new types of parsing or RAG from being implemented. A new plugin type, in a similar style to the other plugins, allows users to add any functionality they need.

Discussion is here.

Added

  • New Function type: Parser. Used when new Knowledge is added (wherever save_docs_to_vector_db was originall called).
  • ParserType Enum: used to allow which types of Knowledge a Parser should be used for
  • DefaultParser: functionality equivalent to the original parsing setup, it is used as a fallback if users do not create their own. Is inheritable to allow users to focus on developing the parts of the plugin that are relevant to their use case.

Changed

  • Removed save_docs_to_vector_db call. Implementation is now in DefaultParser

Example Plugin

import logging
import uuid
import json
from datetime import datetime
from typing import Optional

from fastapi import Request

import tiktoken

from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain_core.documents import Document

from open_webui.env import SRC_LOG_LEVELS
from open_webui.constants import ERROR_MESSAGES
from open_webui.retrieval.utils import get_embedding_function
from open_webui.retrieval.vector.connector import VECTOR_DB_CLIENT
from open_webui.utils.parser import PARSING_TYPE

log = logging.getLogger(__name__)
log.setLevel(SRC_LOG_LEVELS["RAG"])


class Parser:
    # Update valves/ environment variables based on your selected database
    def __init__(self):
        self.name = "Custom Parser"
        self.parser_type = PARSING_TYPE.ALL

    def save_docs_to_vector_db(self,
                               request: Request,
                               docs,
                               collection_name,
                               metadata: Optional[dict] = None,
                               overwrite: bool = False,
                               split: bool = True,
                               add: bool = False,
                               user=None,
                               ) -> bool:
        self.pre(request, docs=docs, collection_name=collection_name)

        print("JUST LIKE DEFAULT BUT THIS ONE IS TOTALLY CUSTOM")

        texts, docs = self.split(request, docs)
        metadatas = self.metadata(request, collection_name, docs, metadata)
        embeddings = self.embed(request, texts, user)

        assert len(metadatas) == len(texts) and f"length mismatch: metadata {metadatas} vs texts {texts}"
        assert len(metadatas) == len(embeddings) and f"length mismatch: metadata {metadatas} vs embeddings {embeddings}"

        self.store(request, collection_name, texts, embeddings, metadatas, overwrite, add)

        self.post(request)

        return True

    def pre(self, request, **kwargs):
        '''
        called before the rest of the parser functions
        '''

        docs = kwargs.pop('docs', None)
        collection_name = kwargs.pop('collection_name', None)

        def _get_docs_info(docs: list[Document]) -> str:
            docs_info = set()

            # Trying to select relevant metadata identifying the document.
            for doc in docs:
                metadata = getattr(doc, "metadata", {})
                doc_name = metadata.get("name", "")
                if not doc_name:
                    doc_name = metadata.get("title", "")
                if not doc_name:
                    doc_name = metadata.get("source", "")
                if doc_name:
                    docs_info.add(doc_name)

            return ", ".join(docs_info)

        log.info(
            f"save_docs_to_vector_db: document {_get_docs_info(docs)} {collection_name}"
        )

    def post(self, request, **kwargs):
        '''
        called after the rest of the parser functions
        '''
        pass

    def metadata(self, request, collection_name, docs, metadata):
        # Check if entries with the same hash (metadata.hash) already exist
        if metadata and "hash" in metadata:
            result = VECTOR_DB_CLIENT.query(
                collection_name=collection_name,
                filter={"hash": metadata["hash"]},
            )

            if result is not None:
                existing_doc_ids = result.ids[0]
                if existing_doc_ids:
                    log.info(f"Document with hash {metadata['hash']} already exists")
                    raise ValueError(ERROR_MESSAGES.DUPLICATE_CONTENT)

        metadatas = [
            {
                **doc.metadata,
                **(metadata if metadata else {}),
                "embedding_config": json.dumps(
                    {
                        "engine": request.app.state.config.RAG_EMBEDDING_ENGINE,
                        "model": request.app.state.config.RAG_EMBEDDING_MODEL,
                    }
                ),
            }
            for doc in docs
        ]

        # ChromaDB does not like datetime formats
        # for meta-data so convert them to string.
        for metadata in metadatas:
            for key, value in metadata.items():
                if (
                        isinstance(value, datetime)
                        or isinstance(value, list)
                        or isinstance(value, dict)
                ):
                    metadata[key] = str(value)

        return metadatas

    def split(self, request, docs):
        if request.app.state.config.TEXT_SPLITTER in ["", "character"]:
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=request.app.state.config.CHUNK_SIZE,
                chunk_overlap=request.app.state.config.CHUNK_OVERLAP,
                add_start_index=True,
            )
        elif request.app.state.config.TEXT_SPLITTER == "token":
            log.info(
                f"Using token text splitter: {request.app.state.config.TIKTOKEN_ENCODING_NAME}"
            )

            tiktoken.get_encoding(str(request.app.state.config.TIKTOKEN_ENCODING_NAME))
            text_splitter = TokenTextSplitter(
                encoding_name=str(request.app.state.config.TIKTOKEN_ENCODING_NAME),
                chunk_size=request.app.state.config.CHUNK_SIZE,
                chunk_overlap=request.app.state.config.CHUNK_OVERLAP,
                add_start_index=True,
            )
        else:
            raise ValueError(ERROR_MESSAGES.DEFAULT("Invalid text splitter"))

        docs = text_splitter.split_documents(docs)


        # METADATA NEEDS TO BE GENERATED HERE

        if len(docs) == 0:
            raise ValueError(ERROR_MESSAGES.EMPTY_CONTENT)

        texts = [doc.page_content for doc in docs]
        return texts, docs

    def embed(self, request, texts, user=None):
        embedding_function = get_embedding_function(
            request.app.state.config.RAG_EMBEDDING_ENGINE,
            request.app.state.config.RAG_EMBEDDING_MODEL,
            request.app.state.ef,
            (
                request.app.state.config.RAG_OPENAI_API_BASE_URL
                if request.app.state.config.RAG_EMBEDDING_ENGINE == "openai"
                else request.app.state.config.RAG_OLLAMA_BASE_URL
            ),
            (
                request.app.state.config.RAG_OPENAI_API_KEY
                if request.app.state.config.RAG_EMBEDDING_ENGINE == "openai"
                else request.app.state.config.RAG_OLLAMA_API_KEY
            ),
            request.app.state.config.RAG_EMBEDDING_BATCH_SIZE,
        )

        embeddings = embedding_function(
            list(map(lambda x: x.replace("\n", " "), texts)), user=user
        )

        return embeddings

    def store(self, request, collection_name, texts, embeddings, metadatas, overwrite=False, add=True):
        # don't do this until the last step to limit deleting collections if errors are thrown
        if VECTOR_DB_CLIENT.has_collection(collection_name=collection_name):
            log.info(f"collection {collection_name} already exists")

            if overwrite:
                VECTOR_DB_CLIENT.delete_collection(collection_name=collection_name)
                log.info(f"deleting existing collection {collection_name}")
            elif not add:
                log.info(
                    f"collection {collection_name} already exists, overwrite is False and add is False"
                )
                return True

        items = [
            {
                "id": str(uuid.uuid4()),
                "text": text,
                "vector": embeddings[idx],
                "metadata": metadatas[idx],
            }
            for idx, text in enumerate(texts)
        ]

        VECTOR_DB_CLIENT.insert(
            collection_name=collection_name,
            items=items,
        )


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/11234 **Author:** [@DanielDowns](https://github.com/DanielDowns) **Created:** 3/5/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `main` --- ### 📝 Commits (9) - [`3a404c9`](https://github.com/open-webui/open-webui/commit/3a404c986085ae798b9a9a3a00455b3ed1ca5c11) added parsing type and basic tracing - [`e0f5487`](https://github.com/open-webui/open-webui/commit/e0f5487d6c544b1c680d1703bf1f42549a5da03f) default parsing now properly parses and uploads - [`f8e201f`](https://github.com/open-webui/open-webui/commit/f8e201f14736736213b3a7f011388f18759d4697) handle all custom parsers and switch to default if none are provided - [`7dd06c6`](https://github.com/open-webui/open-webui/commit/7dd06c60977c30d8420f297da180e6d484e7128d) removed original save function thats now in plugin. moved get_plugin() to correct module. DefaultParser is now named such with proper structure and logging - [`1a51584`](https://github.com/open-webui/open-webui/commit/1a51584fe02ba917e229f52367363ff783babd22) Merge pull request #10939 from open-webui/dev - [`4770285`](https://github.com/open-webui/open-webui/commit/4770285c04b81dfc3eb9ac173dfb2a8afef68105) Merge pull request #11211 from open-webui/dev - [`d512a68`](https://github.com/open-webui/open-webui/commit/d512a688b4c16e842fd910eecc8ec3864ea94c6b) Merge branch 'main' of https://github.com/DanielDowns/open-webui into parser_addition - [`39c5512`](https://github.com/open-webui/open-webui/commit/39c55121320396deadb5006b93b4b9541fe0df03) Merge pull request #1 from DanielDowns/parser_addition - [`d06b6d7`](https://github.com/open-webui/open-webui/commit/d06b6d769cb20cb50e452224e3754a6a21903c5d) removed testing print statement ### 📊 Changes **4 files changed** (+325 additions, -200 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/functions.py` (+35 -1) 📝 `backend/open_webui/routers/retrieval.py` (+68 -199) ➕ `backend/open_webui/utils/parser.py` (+220 -0) 📝 `backend/open_webui/utils/plugin.py` (+2 -0) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [ ] **Testing:** Have you written and run sufficient tests for validating the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To clearly categorize this pull request, prefix the pull request title, using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - Current plugins offer a lot of flexibility but none hook into how Knowledge is uploaded. This prohibits new types of parsing or RAG from being implemented. A new plugin type, in a similar style to the other plugins, allows users to add any functionality they need. Discussion is [here.](https://github.com/open-webui/open-webui/discussions/11169) ### Added - New Function type: Parser. Used when new Knowledge is added (wherever save_docs_to_vector_db was originall called). - ParserType Enum: used to allow which types of Knowledge a Parser should be used for - DefaultParser: functionality equivalent to the original parsing setup, it is used as a fallback if users do not create their own. Is inheritable to allow users to focus on developing the parts of the plugin that are relevant to their use case. ### Changed - Removed save_docs_to_vector_db call. Implementation is now in DefaultParser ### Example Plugin ``` import logging import uuid import json from datetime import datetime from typing import Optional from fastapi import Request import tiktoken from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter from langchain_core.documents import Document from open_webui.env import SRC_LOG_LEVELS from open_webui.constants import ERROR_MESSAGES from open_webui.retrieval.utils import get_embedding_function from open_webui.retrieval.vector.connector import VECTOR_DB_CLIENT from open_webui.utils.parser import PARSING_TYPE log = logging.getLogger(__name__) log.setLevel(SRC_LOG_LEVELS["RAG"]) class Parser: # Update valves/ environment variables based on your selected database def __init__(self): self.name = "Custom Parser" self.parser_type = PARSING_TYPE.ALL def save_docs_to_vector_db(self, request: Request, docs, collection_name, metadata: Optional[dict] = None, overwrite: bool = False, split: bool = True, add: bool = False, user=None, ) -> bool: self.pre(request, docs=docs, collection_name=collection_name) print("JUST LIKE DEFAULT BUT THIS ONE IS TOTALLY CUSTOM") texts, docs = self.split(request, docs) metadatas = self.metadata(request, collection_name, docs, metadata) embeddings = self.embed(request, texts, user) assert len(metadatas) == len(texts) and f"length mismatch: metadata {metadatas} vs texts {texts}" assert len(metadatas) == len(embeddings) and f"length mismatch: metadata {metadatas} vs embeddings {embeddings}" self.store(request, collection_name, texts, embeddings, metadatas, overwrite, add) self.post(request) return True def pre(self, request, **kwargs): ''' called before the rest of the parser functions ''' docs = kwargs.pop('docs', None) collection_name = kwargs.pop('collection_name', None) def _get_docs_info(docs: list[Document]) -> str: docs_info = set() # Trying to select relevant metadata identifying the document. for doc in docs: metadata = getattr(doc, "metadata", {}) doc_name = metadata.get("name", "") if not doc_name: doc_name = metadata.get("title", "") if not doc_name: doc_name = metadata.get("source", "") if doc_name: docs_info.add(doc_name) return ", ".join(docs_info) log.info( f"save_docs_to_vector_db: document {_get_docs_info(docs)} {collection_name}" ) def post(self, request, **kwargs): ''' called after the rest of the parser functions ''' pass def metadata(self, request, collection_name, docs, metadata): # Check if entries with the same hash (metadata.hash) already exist if metadata and "hash" in metadata: result = VECTOR_DB_CLIENT.query( collection_name=collection_name, filter={"hash": metadata["hash"]}, ) if result is not None: existing_doc_ids = result.ids[0] if existing_doc_ids: log.info(f"Document with hash {metadata['hash']} already exists") raise ValueError(ERROR_MESSAGES.DUPLICATE_CONTENT) metadatas = [ { **doc.metadata, **(metadata if metadata else {}), "embedding_config": json.dumps( { "engine": request.app.state.config.RAG_EMBEDDING_ENGINE, "model": request.app.state.config.RAG_EMBEDDING_MODEL, } ), } for doc in docs ] # ChromaDB does not like datetime formats # for meta-data so convert them to string. for metadata in metadatas: for key, value in metadata.items(): if ( isinstance(value, datetime) or isinstance(value, list) or isinstance(value, dict) ): metadata[key] = str(value) return metadatas def split(self, request, docs): if request.app.state.config.TEXT_SPLITTER in ["", "character"]: text_splitter = RecursiveCharacterTextSplitter( chunk_size=request.app.state.config.CHUNK_SIZE, chunk_overlap=request.app.state.config.CHUNK_OVERLAP, add_start_index=True, ) elif request.app.state.config.TEXT_SPLITTER == "token": log.info( f"Using token text splitter: {request.app.state.config.TIKTOKEN_ENCODING_NAME}" ) tiktoken.get_encoding(str(request.app.state.config.TIKTOKEN_ENCODING_NAME)) text_splitter = TokenTextSplitter( encoding_name=str(request.app.state.config.TIKTOKEN_ENCODING_NAME), chunk_size=request.app.state.config.CHUNK_SIZE, chunk_overlap=request.app.state.config.CHUNK_OVERLAP, add_start_index=True, ) else: raise ValueError(ERROR_MESSAGES.DEFAULT("Invalid text splitter")) docs = text_splitter.split_documents(docs) # METADATA NEEDS TO BE GENERATED HERE if len(docs) == 0: raise ValueError(ERROR_MESSAGES.EMPTY_CONTENT) texts = [doc.page_content for doc in docs] return texts, docs def embed(self, request, texts, user=None): embedding_function = get_embedding_function( request.app.state.config.RAG_EMBEDDING_ENGINE, request.app.state.config.RAG_EMBEDDING_MODEL, request.app.state.ef, ( request.app.state.config.RAG_OPENAI_API_BASE_URL if request.app.state.config.RAG_EMBEDDING_ENGINE == "openai" else request.app.state.config.RAG_OLLAMA_BASE_URL ), ( request.app.state.config.RAG_OPENAI_API_KEY if request.app.state.config.RAG_EMBEDDING_ENGINE == "openai" else request.app.state.config.RAG_OLLAMA_API_KEY ), request.app.state.config.RAG_EMBEDDING_BATCH_SIZE, ) embeddings = embedding_function( list(map(lambda x: x.replace("\n", " "), texts)), user=user ) return embeddings def store(self, request, collection_name, texts, embeddings, metadatas, overwrite=False, add=True): # don't do this until the last step to limit deleting collections if errors are thrown if VECTOR_DB_CLIENT.has_collection(collection_name=collection_name): log.info(f"collection {collection_name} already exists") if overwrite: VECTOR_DB_CLIENT.delete_collection(collection_name=collection_name) log.info(f"deleting existing collection {collection_name}") elif not add: log.info( f"collection {collection_name} already exists, overwrite is False and add is False" ) return True items = [ { "id": str(uuid.uuid4()), "text": text, "vector": embeddings[idx], "metadata": metadatas[idx], } for idx, text in enumerate(texts) ] VECTOR_DB_CLIENT.insert( collection_name=collection_name, items=items, ) ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 04:19:24 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#22693