File deletion doesn&#39;t properly clean up database entries, causing issues with re-uploads #2751

@Constey commented on GitHub (Dec 8, 2024):

From my thought i can just call the api with the same file again to do a re-upload.
it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue.
running on: (v0.4.8
(latest)

Steps to Reproduce:

Upload a file
add the file to the knowledgebase
Upload the file again
Try adding it to the knowledgebase again

Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746
Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b...
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add

My test script for upload:
def add_file_to_knowledge(token, knowledge_id, file_id, base_url):
url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add'
headers = {
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json'
}
data = {'file_id': file_id}
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
return response.json()

@Constey commented on GitHub (Dec 8, 2024): From my thought i can just call the api with the same file again to do a re-upload. it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue. running on: (v0.4.8 [(latest)](https://github.com/open-webui/open-webui/releases/tag/v0.4.8) Steps to Reproduce: - Upload a file - add the file to the knowledgebase - Upload the file again - Try adding it to the knowledgebase again Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746 Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b... requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add My test script for upload: def add_file_to_knowledge(token, knowledge_id, file_id, base_url): url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() return response.json()

GiteaMirror commented

@Constey commented on GitHub (Dec 8, 2024):

I think the issue is somewhere here located: 29a2719595/backend/open_webui/apps/webui/routers/knowledge.py (L244)

@Constey commented on GitHub (Dec 8, 2024): I think the issue is somewhere here located: https://github.com/open-webui/open-webui/blob/29a271959556743e6deb4d55a5a982983335d7ab/backend/open_webui/apps/webui/routers/knowledge.py#L244

GiteaMirror commented

@AlgorithmicKing737 commented on GitHub (Dec 31, 2024):

any solution yet?

@AlgorithmicKing737 commented on GitHub (Dec 31, 2024): any solution yet?

GiteaMirror commented

@Classic298 commented on GitHub (Jan 13, 2025):

I want to add to this issue, that files uploaded in normal chats are not deleted from the vector database either. Even if you delete the chat, the vector database does not shrink. It stays the same size. And in fact, it only grows.

Even if you press "reset vector-storage" in the admin panel under documents, nothing gets deleted from the database.

So 1) nothing gets deleted even if the chat where the file was uploaded is deleted and 2) the reset vectorstorage button also doesn't do anything.

I am on version 0.5.4 but this was always the case for me on previous versions as well. I am on pip installation if that matters and this issue has been discussed here as well: https://github.com/open-webui/open-webui/discussions/5558

@Classic298 commented on GitHub (Jan 13, 2025): I want to add to this issue, that files uploaded in normal chats are not deleted from the vector database either. Even if you delete the chat, the vector database does not shrink. It stays the same size. And in fact, it only grows. Even if you press "reset vector-storage" in the admin panel under documents, nothing gets deleted from the database. So 1) nothing gets deleted even if the chat where the file was uploaded is deleted and 2) the reset vectorstorage button also doesn't do anything. I am on version 0.5.4 but this was always the case for me on previous versions as well. I am on pip installation if that matters and this issue has been discussed here as well: https://github.com/open-webui/open-webui/discussions/5558

GiteaMirror commented

@juananpe commented on GitHub (Jan 13, 2025):

@Classic298 Oh, I see. My PR https://github.com/open-webui/open-webui/pull/8499 fixes the situation when you remove a file that has been added via a Knowledge Base, but it doesn't fix the problem when the file is added directly from the Upload Documents option in the chat. I'll have a look at it tomorrow.

@juananpe commented on GitHub (Jan 13, 2025): @Classic298 Oh, I see. My PR https://github.com/open-webui/open-webui/pull/8499 fixes the situation when you remove a file that has been added via a `Knowledge Base`, but it doesn't fix the problem when the file is added directly from the `Upload Documents` option in the chat. I'll have a look at it tomorrow.

GiteaMirror commented

@Classic298 commented on GitHub (Jan 22, 2025):

Was the issue with files not being deleted from the db even after deleting the chat fixed?

@Classic298 commented on GitHub (Jan 22, 2025): Was the issue with files not being deleted from the db even after deleting the chat fixed?

GiteaMirror commented

@tjbck commented on GitHub (Jan 22, 2025):

Everything uploaded to Open WebUI is being kept for audit/logging purposes which is a security requirement for many organisations. You should utilise external scripts to clean the upload directory for now!

@tjbck commented on GitHub (Jan 22, 2025): Everything uploaded to Open WebUI is being kept for audit/logging purposes which is a security requirement for many organisations. You should utilise external scripts to clean the upload directory for now!

GiteaMirror commented

@Classic298 commented on GitHub (Jan 23, 2025):

Then why was the deletion of files, when deleting them from the knowledge base, even implemented and accepted by you? If files should not get deleted.

And even the implementation of file deletion from chats was accepted and merged by you - it is 90% implemented. Only the actual deletion logic for the chroma db is missing.

Maybe with an environment variable or admin setting (either is fine), it would be cool to be able to set this.
An ever growing chroma database and uploads folder will grow to be a problem relatively quickly, no?

@Classic298 commented on GitHub (Jan 23, 2025): Then why was the deletion of files, when deleting them from the knowledge base, even implemented and accepted by you? If files should not get deleted. And even the implementation of file deletion from chats was accepted and merged by you - it is 90% implemented. Only the actual deletion logic for the chroma db is missing. Maybe with an environment variable or admin setting (either is fine), it would be cool to be able to set this. An ever growing chroma database and uploads folder will grow to be a problem relatively quickly, no?

GiteaMirror commented

@Classic298 commented on GitHub (Jan 25, 2025):

This issue was not fixed yet as there is literally a placeholder for the missing code, just saying. Writing in the commit notes that 7181 is fixed is weird

@Classic298 commented on GitHub (Jan 25, 2025): This issue was not fixed yet as there is literally a placeholder for the missing code, just saying. Writing in the commit notes that 7181 is fixed is weird

GiteaMirror commented

2025-11-11 15:13:38 -06:00

@Classic298 commented on GitHub (Jan 29, 2025):

Bump; - issue is not fixed and current implementation goes against ethos that Tim described.

@Classic298 commented on GitHub (Jan 29, 2025): Bump; - issue is not fixed and current implementation goes against ethos that Tim described.

GiteaMirror commented

@tjbck commented on GitHub (Jan 29, 2025):

Reverted #8499 4abede9a2b

@tjbck commented on GitHub (Jan 29, 2025): Reverted #8499 4abede9a2bad7902e23e8bff2de93fff2c163ce4

GiteaMirror commented

2025-11-11 15:13:38 -06:00

@Jeevanhm commented on GitHub (Feb 6, 2025):

From my thought i can just call the api with the same file again to do a re-upload. it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue. running on: (v0.4.8 (latest)

Steps to Reproduce:

Upload a file

add the file to the knowledgebase

Upload the file again

Try adding it to the knowledgebase again
Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746
Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b...
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add

My test script for upload: def add_file_to_knowledge(token, knowledge_id, file_id, base_url): url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() return response.json()

can you share the final script used to upload and add files to the knowledge base please.

@Jeevanhm commented on GitHub (Feb 6, 2025): > From my thought i can just call the api with the same file again to do a re-upload. it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue. running on: (v0.4.8 [(latest)](https://github.com/open-webui/open-webui/releases/tag/v0.4.8) > > Steps to Reproduce: > > * Upload a file > * add the file to the knowledgebase > * Upload the file again > * Try adding it to the knowledgebase again > Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746 > Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b... > requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add > > My test script for upload: def add_file_to_knowledge(token, knowledge_id, file_id, base_url): url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() return response.json() can you share the final script used to upload and add files to the knowledge base please.

GiteaMirror commented

2025-11-11 15:13:39 -06:00

@Jeevanhm commented on GitHub (Feb 7, 2025):

Getting this error while adding files to Knowledge Collections.. any idea? Uploading and deleting the files works.

C:\Windows\system32>curl -X POST http://192.xx.xx.xx:/api/v1/knowledge/fb24ac30-d611-4988-90dc-b29fe10d118a/file/add -H "Authorization: Bearer sk-f821a6733a024915932dc30ed44b2d4a" -H "Content-Type: application/json" -d '{"file_id": "430e0a59-5fea-4d4e-87c1-8f2ad38c3dda"}'

{"detail":[{"type":"json_invalid","loc":["body",0],"msg":"JSON decode error","input":{},"ctx":{"error":"Expecting value"}}]}curl: (3) unmatched close brace/bracket in URL position 37:
430e0a59-5fea-4d4e-87c1-8f2ad38c3dda}'
^

@Jeevanhm commented on GitHub (Feb 7, 2025): Getting this error while adding files to Knowledge Collections.. any idea? Uploading and deleting the files works. C:\Windows\system32>curl -X POST http://192.xx.xx.xx:/api/v1/knowledge/fb24ac30-d611-4988-90dc-b29fe10d118a/file/add -H "Authorization: Bearer sk-f821a6733a024915932dc30ed44b2d4a" -H "Content-Type: application/json" -d '{"file_id": "430e0a59-5fea-4d4e-87c1-8f2ad38c3dda"}' {"detail":[{"type":"json_invalid","loc":["body",0],"msg":"JSON decode error","input":{},"ctx":{"error":"Expecting value"}}]}curl: (3) unmatched close brace/bracket in URL position 37: 430e0a59-5fea-4d4e-87c1-8f2ad38c3dda}' ^

GiteaMirror commented

2025-11-11 15:13:39 -06:00

@Constey commented on GitHub (Feb 7, 2025):

From my thought i can just call the api with the same file again to do a re-upload. it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue. running on: (v0.4.8 (latest)
Steps to Reproduce:

Upload a file

add the file to the knowledgebase

Upload the file again

Try adding it to the knowledgebase again
Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746
Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b...
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add

My test script for upload: def add_file_to_knowledge(token, knowledge_id, file_id, base_url): url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() return response.json()

can you share the final script used to upload and add files to the knowledge base please.

like this, we put the files in separate kb's so its a bit more complex, but should show how it works:

"""
This script uploads .txt files from a directory structure into two different knowledgebases
based on a "special_folder" configuration. If a folder matches the special_folder, the file
is uploaded to knowledgebase B, otherwise to knowledgebase A.

New Feature (version 1.4.0):
    A command-line argument `--only-special` has been added. When used, only files within
    the special_folder subdirectories will be uploaded. All other files/folders will be skipped.

Exception handling has been added to ensure that if an error occurs (like a HTTPError from
the server), the script logs the error to upload_webui.log and continues processing the
remaining files.

A per-space summary is included to show how many files were processed, how many were
successfully uploaded, and how many failed to upload for each top-level folder (space).

Prerequisites:
    pip install requests

Usage:
    python upload_script.py [--only-special]

    --only-special   Only upload files that belong to the special folder (e.g., "5027").

Configuration:
    - Update config.py to include:
      {
        "output_text_base_dir": "text_output",
        "openwebui_url": "http://localhost:3000",
        "openwebui_token": "YOUR_OPENWEBUI_TOKEN_HERE",
        "openwebui_knowledge_id_a": "KNOWLEDGE_COLLECTION_ID_A",
        "openwebui_knowledge_id_b": "KNOWLEDGE_COLLECTION_ID_B",
        "special_folder": "5027"
      }
"""

import os
import sys
import argparse
import requests
import logging
from config import config

# Configure logging to a file named "upload_webui.log".
# INFO level logs general operation flow, WARNING for unusual states, ERROR for exceptions.
logging.basicConfig(
    filename='upload_webui.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def upload_file(token, file_path, base_url):
    """
    Uploads a single file to the web UI. Returns the JSON response from the server.
    Raises requests.exceptions.HTTPError if the server responds with an error status code.

    :param token: Auth token.
    :param file_path: Path to the file being uploaded.
    :param base_url: Base URL for the OpenWebUI server.
    :return: JSON response from the server with the 'id' of the uploaded file.
    """
    url = f'{base_url}/api/v1/files/'
    headers = {
        'Authorization': f'Bearer {token}',
        'Accept': 'application/json'
    }

    with open(file_path, 'rb') as f:
        files = {'file': f}
        response = requests.post(url, headers=headers, files=files)
    response.raise_for_status()  # Will raise HTTPError for 4xx/5xx responses
    return response.json()

def add_file_to_knowledge(token, knowledge_id, file_id, base_url):
    """
    Adds an uploaded file to a specified knowledgebase.

    :param token: Auth token.
    :param knowledge_id: ID of the target knowledgebase.
    :param file_id: ID of the file to be added.
    :param base_url: Base URL for the OpenWebUI server.
    :return: JSON response from the server.
    """
    url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add'
    headers = {
        'Authorization': f'Bearer {token}',
        'Content-Type': 'application/json'
    }
    data = {'file_id': file_id}
    response = requests.post(url, headers=headers, json=data)
    response.raise_for_status()  # Will raise HTTPError for 4xx/5xx responses
    return response.json()

def main():
    """
    Main logic to:
    1. Parse command-line arguments.
    2. Traverse text_base_dir for .txt files.
    3. If --only-special is set, only process files in the special_folder.
    4. Decide which knowledgebase to upload the file to based on 'special_folder'.
    5. Attempt file upload and knowledgebase addition, handle errors by logging and continuing.
    6. Keep track of per-space metrics (processed, success, fail).
    7. Print and log summary metrics for each space at the end of the script.
    """
    parser = argparse.ArgumentParser(description="Upload text files to knowledgebase(s).")
    parser.add_argument(
        "--only-special",
        action="store_true",
        help="Only upload files in the special_folder (skip others)."
    )
    args = parser.parse_args()
    only_special = args.only_special

    text_base_dir = config["output_text_base_dir"]
    token = config["openwebui_token"]
    base_url = config["openwebui_url"]
    knowledge_id_a = config["openwebui_knowledge_id_a"]
    knowledge_id_b = config["openwebui_knowledge_id_b"]
    special_folder = config["special_folder"]

    if not os.path.exists(text_base_dir):
        msg = f"Text directory '{text_base_dir}' does not exist. Nothing to upload."
        print(msg)
        logging.warning(msg)
        return

    # Counters for knowledgebase A and B
    uploaded_count_a = 0
    uploaded_count_b = 0

    # Dictionary to track counts per space: { 'SPACEKEY': {'processed': 0, 'success': 0, 'fail': 0}, ... }
    space_summary = {}

    for root, dirs, files in os.walk(text_base_dir):
        # Determine the relative path from text_base_dir
        relative_path = os.path.relpath(root, text_base_dir)
        current_folders = relative_path.split(os.sep)

        # Identify the top-level space name (if we're not at the root of text_base_dir)
        if relative_path == ".":
            space_name = "ROOT"
        else:
            space_name = current_folders[0]

        # Ensure this space_name is in our space_summary dictionary
        if space_name not in space_summary:
            space_summary[space_name] = {"processed": 0, "success": 0, "fail": 0}

        # If only_special is set, skip this entire path unless it has special_folder
        if only_special and special_folder not in current_folders:
            continue

        # Determine the target knowledgebase (A or B)
        if special_folder in current_folders:
            target_knowledge_id = knowledge_id_b
            target_count_var = "B"
        else:
            target_knowledge_id = knowledge_id_a
            target_count_var = "A"

        for file in files:
            if file.endswith(".txt"):
                # We have a text file; increment the processed counter for this space
                space_summary[space_name]["processed"] += 1

                file_path = os.path.join(root, file)
                message = f"Uploading file: {file_path} to knowledgebase {target_knowledge_id}"
                print(message)
                logging.info(message)

                # Attempt the upload
                try:
                    upload_response = upload_file(token, file_path, base_url)
                    file_id = upload_response.get("id")
                except requests.exceptions.HTTPError as http_err:
                    error_msg = (
                        f"HTTP error occurred while uploading {file_path}: {http_err}"
                    )
                    print(error_msg)
                    logging.error(error_msg, exc_info=True)
                    # Mark as fail for this space
                    space_summary[space_name]["fail"] += 1
                    continue  # Skip to the next file
                except Exception as ex:
                    error_msg = (
                        f"Unexpected error occurred while uploading {file_path}: {ex}"
                    )
                    print(error_msg)
                    logging.error(error_msg, exc_info=True)
                    # Mark as fail for this space
                    space_summary[space_name]["fail"] += 1
                    continue  # Skip to the next file

                if file_id:
                    success_msg = f"  Uploaded successfully with file_id: {file_id}"
                    print(success_msg)
                    logging.info(success_msg)
                else:
                    warn_msg = f"  Could not get file_id from response: {upload_response}"
                    print(warn_msg)
                    logging.warning(warn_msg)
                    # Mark as fail for this space since we have no file_id
                    space_summary[space_name]["fail"] += 1
                    continue

                # Attempt to add the uploaded file to the chosen knowledgebase
                try:
                    adding_msg = f"  Adding file {file_id} to knowledge {target_knowledge_id}..."
                    print(adding_msg)
                    logging.info(adding_msg)

                    add_response = add_file_to_knowledge(token, target_knowledge_id, file_id, base_url)

                    added_msg = f"  Added to knowledge successfully. Response: {add_response}"
                    print(added_msg)
                    logging.info(added_msg)
                except requests.exceptions.HTTPError as http_err:
                    error_msg = (
                        f"HTTP error occurred while adding file {file_id} to knowledge "
                        f"{target_knowledge_id}: {http_err}"
                    )
                    print(error_msg)
                    logging.error(error_msg, exc_info=True)
                    space_summary[space_name]["fail"] += 1
                    continue
                except Exception as ex:
                    error_msg = (
                        f"Unexpected error occurred while adding file {file_id} to knowledge "
                        f"{target_knowledge_id}: {ex}"
                    )
                    print(error_msg)
                    logging.error(error_msg, exc_info=True)
                    space_summary[space_name]["fail"] += 1
                    continue

                # If we made it this far, the file was successfully uploaded and added
                if target_count_var == "A":
                    uploaded_count_a += 1
                else:
                    uploaded_count_b += 1

                space_summary[space_name]["success"] += 1

    print("Finished uploading.")
    print(f"Uploaded {uploaded_count_a} files to knowledge A (ID: {knowledge_id_a})")
    print(f"Uploaded {uploaded_count_b} files to knowledge B (ID: {knowledge_id_b})")
    logging.info(
        f"Finished uploading. Uploaded {uploaded_count_a} files to knowledge A and "
        f"{uploaded_count_b} files to knowledge B."
    )

    # Print and log a summary for each space:
    print("\nPer-space summary:")
    logging.info("Per-space summary:")
    for space, stats in space_summary.items():
        processed = stats["processed"]
        success = stats["success"]
        fail = stats["fail"]
        msg_summary = (
            f"Space '{space}' - Processed: {processed}, Successful: {success}, Failed: {fail}"
        )
        print(msg_summary)
        logging.info(msg_summary)

if __name__ == "__main__":
    main()

@Constey commented on GitHub (Feb 7, 2025): > > From my thought i can just call the api with the same file again to do a re-upload. it seems the upload of a file works, but adding the file to the knowledge brings the 400 bad request issue. running on: (v0.4.8 [(latest)](https://github.com/open-webui/open-webui/releases/tag/v0.4.8) > > Steps to Reproduce: > > > > * Upload a file > > * add the file to the knowledgebase > > * Upload the file again > > * Try adding it to the knowledgebase again > > Uploaded successfully with file_id: 967dd000-429c-46bb-9931-f352364dd746 > > Adding file 967dd000-429c-46bb-9931-f352364dd746 to knowledge def088e5-a452-4dd9-b67d-4de942f3785b... > > requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://xxx/api/v1/knowledge/def088e5-a452-4dd9-b67d-4de942f3785b/file/add > > > > My test script for upload: def add_file_to_knowledge(token, knowledge_id, file_id, base_url): url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() return response.json() > > can you share the final script used to upload and add files to the knowledge base please. like this, we put the files in separate kb's so its a bit more complex, but should show how it works: ``` """ This script uploads .txt files from a directory structure into two different knowledgebases based on a "special_folder" configuration. If a folder matches the special_folder, the file is uploaded to knowledgebase B, otherwise to knowledgebase A. New Feature (version 1.4.0): A command-line argument `--only-special` has been added. When used, only files within the special_folder subdirectories will be uploaded. All other files/folders will be skipped. Exception handling has been added to ensure that if an error occurs (like a HTTPError from the server), the script logs the error to upload_webui.log and continues processing the remaining files. A per-space summary is included to show how many files were processed, how many were successfully uploaded, and how many failed to upload for each top-level folder (space). Prerequisites: pip install requests Usage: python upload_script.py [--only-special] --only-special Only upload files that belong to the special folder (e.g., "5027"). Configuration: - Update config.py to include: { "output_text_base_dir": "text_output", "openwebui_url": "http://localhost:3000", "openwebui_token": "YOUR_OPENWEBUI_TOKEN_HERE", "openwebui_knowledge_id_a": "KNOWLEDGE_COLLECTION_ID_A", "openwebui_knowledge_id_b": "KNOWLEDGE_COLLECTION_ID_B", "special_folder": "5027" } """ import os import sys import argparse import requests import logging from config import config # Configure logging to a file named "upload_webui.log". # INFO level logs general operation flow, WARNING for unusual states, ERROR for exceptions. logging.basicConfig( filename='upload_webui.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) def upload_file(token, file_path, base_url): """ Uploads a single file to the web UI. Returns the JSON response from the server. Raises requests.exceptions.HTTPError if the server responds with an error status code. :param token: Auth token. :param file_path: Path to the file being uploaded. :param base_url: Base URL for the OpenWebUI server. :return: JSON response from the server with the 'id' of the uploaded file. """ url = f'{base_url}/api/v1/files/' headers = { 'Authorization': f'Bearer {token}', 'Accept': 'application/json' } with open(file_path, 'rb') as f: files = {'file': f} response = requests.post(url, headers=headers, files=files) response.raise_for_status() # Will raise HTTPError for 4xx/5xx responses return response.json() def add_file_to_knowledge(token, knowledge_id, file_id, base_url): """ Adds an uploaded file to a specified knowledgebase. :param token: Auth token. :param knowledge_id: ID of the target knowledgebase. :param file_id: ID of the file to be added. :param base_url: Base URL for the OpenWebUI server. :return: JSON response from the server. """ url = f'{base_url}/api/v1/knowledge/{knowledge_id}/file/add' headers = { 'Authorization': f'Bearer {token}', 'Content-Type': 'application/json' } data = {'file_id': file_id} response = requests.post(url, headers=headers, json=data) response.raise_for_status() # Will raise HTTPError for 4xx/5xx responses return response.json() def main(): """ Main logic to: 1. Parse command-line arguments. 2. Traverse text_base_dir for .txt files. 3. If --only-special is set, only process files in the special_folder. 4. Decide which knowledgebase to upload the file to based on 'special_folder'. 5. Attempt file upload and knowledgebase addition, handle errors by logging and continuing. 6. Keep track of per-space metrics (processed, success, fail). 7. Print and log summary metrics for each space at the end of the script. """ parser = argparse.ArgumentParser(description="Upload text files to knowledgebase(s).") parser.add_argument( "--only-special", action="store_true", help="Only upload files in the special_folder (skip others)." ) args = parser.parse_args() only_special = args.only_special text_base_dir = config["output_text_base_dir"] token = config["openwebui_token"] base_url = config["openwebui_url"] knowledge_id_a = config["openwebui_knowledge_id_a"] knowledge_id_b = config["openwebui_knowledge_id_b"] special_folder = config["special_folder"] if not os.path.exists(text_base_dir): msg = f"Text directory '{text_base_dir}' does not exist. Nothing to upload." print(msg) logging.warning(msg) return # Counters for knowledgebase A and B uploaded_count_a = 0 uploaded_count_b = 0 # Dictionary to track counts per space: { 'SPACEKEY': {'processed': 0, 'success': 0, 'fail': 0}, ... } space_summary = {} for root, dirs, files in os.walk(text_base_dir): # Determine the relative path from text_base_dir relative_path = os.path.relpath(root, text_base_dir) current_folders = relative_path.split(os.sep) # Identify the top-level space name (if we're not at the root of text_base_dir) if relative_path == ".": space_name = "ROOT" else: space_name = current_folders[0] # Ensure this space_name is in our space_summary dictionary if space_name not in space_summary: space_summary[space_name] = {"processed": 0, "success": 0, "fail": 0} # If only_special is set, skip this entire path unless it has special_folder if only_special and special_folder not in current_folders: continue # Determine the target knowledgebase (A or B) if special_folder in current_folders: target_knowledge_id = knowledge_id_b target_count_var = "B" else: target_knowledge_id = knowledge_id_a target_count_var = "A" for file in files: if file.endswith(".txt"): # We have a text file; increment the processed counter for this space space_summary[space_name]["processed"] += 1 file_path = os.path.join(root, file) message = f"Uploading file: {file_path} to knowledgebase {target_knowledge_id}" print(message) logging.info(message) # Attempt the upload try: upload_response = upload_file(token, file_path, base_url) file_id = upload_response.get("id") except requests.exceptions.HTTPError as http_err: error_msg = ( f"HTTP error occurred while uploading {file_path}: {http_err}" ) print(error_msg) logging.error(error_msg, exc_info=True) # Mark as fail for this space space_summary[space_name]["fail"] += 1 continue # Skip to the next file except Exception as ex: error_msg = ( f"Unexpected error occurred while uploading {file_path}: {ex}" ) print(error_msg) logging.error(error_msg, exc_info=True) # Mark as fail for this space space_summary[space_name]["fail"] += 1 continue # Skip to the next file if file_id: success_msg = f" Uploaded successfully with file_id: {file_id}" print(success_msg) logging.info(success_msg) else: warn_msg = f" Could not get file_id from response: {upload_response}" print(warn_msg) logging.warning(warn_msg) # Mark as fail for this space since we have no file_id space_summary[space_name]["fail"] += 1 continue # Attempt to add the uploaded file to the chosen knowledgebase try: adding_msg = f" Adding file {file_id} to knowledge {target_knowledge_id}..." print(adding_msg) logging.info(adding_msg) add_response = add_file_to_knowledge(token, target_knowledge_id, file_id, base_url) added_msg = f" Added to knowledge successfully. Response: {add_response}" print(added_msg) logging.info(added_msg) except requests.exceptions.HTTPError as http_err: error_msg = ( f"HTTP error occurred while adding file {file_id} to knowledge " f"{target_knowledge_id}: {http_err}" ) print(error_msg) logging.error(error_msg, exc_info=True) space_summary[space_name]["fail"] += 1 continue except Exception as ex: error_msg = ( f"Unexpected error occurred while adding file {file_id} to knowledge " f"{target_knowledge_id}: {ex}" ) print(error_msg) logging.error(error_msg, exc_info=True) space_summary[space_name]["fail"] += 1 continue # If we made it this far, the file was successfully uploaded and added if target_count_var == "A": uploaded_count_a += 1 else: uploaded_count_b += 1 space_summary[space_name]["success"] += 1 print("Finished uploading.") print(f"Uploaded {uploaded_count_a} files to knowledge A (ID: {knowledge_id_a})") print(f"Uploaded {uploaded_count_b} files to knowledge B (ID: {knowledge_id_b})") logging.info( f"Finished uploading. Uploaded {uploaded_count_a} files to knowledge A and " f"{uploaded_count_b} files to knowledge B." ) # Print and log a summary for each space: print("\nPer-space summary:") logging.info("Per-space summary:") for space, stats in space_summary.items(): processed = stats["processed"] success = stats["success"] fail = stats["fail"] msg_summary = ( f"Space '{space}' - Processed: {processed}, Successful: {success}, Failed: {fail}" ) print(msg_summary) logging.info(msg_summary) if __name__ == "__main__": main() ```

GiteaMirror commented

2025-11-11 15:13:40 -06:00

@Jeevanhm commented on GitHub (Feb 7, 2025):

thank you it works like a charm!

@Jeevanhm commented on GitHub (Feb 7, 2025): thank you it works like a charm!

GiteaMirror commented

2025-11-11 15:13:40 -06:00

@Jeevanhm commented on GitHub (Feb 9, 2025):

@Constey how do you manage the files on the server?

I'm unable to delete specific files using file id but when I try with "all" the files are deleted on the server.

curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/2bc4340d-4b70-477d-b621-714b854c9817' -H 'accept: application/json'

curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/all' -H 'accept: application/json'

@Jeevanhm commented on GitHub (Feb 9, 2025): @Constey how do you manage the files on the server? I'm unable to delete specific files using file id but when I try with "all" the files are deleted on the server. curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/2bc4340d-4b70-477d-b621-714b854c9817' -H 'accept: application/json' curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/all' -H 'accept: application/json'

GiteaMirror commented

2025-11-11 15:13:40 -06:00

@Constey commented on GitHub (Feb 9, 2025):

My Initial Plan was to just overwrite the files, but since this did not worked Ive currently just created new kbs an relinked them to the Model (and deleted the old kbs manually). I have to test how the current behaviour is (I guess it's not fixed) but if you find a way to delete the old ones, that would be nice. Am 09.02.2025 05:29 schrieb Jay @.***>:
@Constey how do you manage the files on the server?
I'm unable to delete specific files using file id but when I try with "all" the files are deleted on the server.
curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/2bc4340d-4b70-477d-b621-714b854c9817' -H 'accept: application/json'
curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/all' -H 'accept: application/json'

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

@Constey commented on GitHub (Feb 9, 2025): My Initial Plan was to just overwrite the files, but since this did not worked Ive currently just created new kbs an relinked them to the Model (and deleted the old kbs manually). I have to test how the current behaviour is (I guess it's not fixed) but if you find a way to delete the old ones, that would be nice. Am 09.02.2025 05:29 schrieb Jay ***@***.***>: @Constey how do you manage the files on the server? I'm unable to delete specific files using file id but when I try with "all" the files are deleted on the server. curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/2bc4340d-4b70-477d-b621-714b854c9817' -H 'accept: application/json' curl -X 'DELETE' 'http://192.xx.xx.xx:8080/api/v1/files/all' -H 'accept: application/json' —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

GiteaMirror commented

@ozp commented on GitHub (Feb 10, 2025):

Hello,

I uploaded the OpenWebUI docs to the RAG system following the documentation instructions. This allows me to ask the chat questions about OpenWebUI.

However, I noticed poor performance with the default RAG settings. So, I created a new configuration that requires deleting all previously uploaded files and re-uploading them.

This is when I encountered the duplication issue.

What I’ve Tried So Far:

Deleted files from the directory (uploads, cache, vector_db, etc.).
Deleted data from the database (with the help of GPT, Claude, DeepSeek, and others).

Here are some of the SQL commands I attempted:

DELETE FROM embedding_metadata;
DELETE FROM embeddings;
DELETE FROM segment_metadata;
DELETE FROM segments;
DELETE FROM collections;
VACUUM;

Modified .md and .mdx files by adding an extra line, yet they were still detected as duplicates.

Request:

Could you provide a step-by-step guide on how to completely remove all previously indexed files from the knowledge base? I’d really appreciate it.

@ozp commented on GitHub (Feb 10, 2025): Hello, I uploaded the OpenWebUI **docs** to the RAG system following the documentation instructions. This allows me to ask the chat questions about OpenWebUI. However, I noticed **poor performance** with the default RAG settings. So, I created a new configuration that requires **deleting all previously uploaded files** and re-uploading them. This is when I encountered the **duplication issue**. ### **What I’ve Tried So Far:** - **Deleted files from the directory** (uploads, cache, vector_db, etc.). - **Deleted data from the database** (with the help of GPT, Claude, DeepSeek, and others). - Here are some of the SQL commands I attempted: ```sql DELETE FROM embedding_metadata; DELETE FROM embeddings; DELETE FROM segment_metadata; DELETE FROM segments; DELETE FROM collections; VACUUM; ``` - **Modified .md and .mdx files by adding an extra line**, yet they were still detected as duplicates. ### **Request:** Could you provide a **step-by-step guide** on how to **completely remove all previously indexed files** from the knowledge base? I’d really appreciate it.

GiteaMirror commented

@gilbrotheraway commented on GitHub (Mar 28, 2025):

in 24h with less than 5 knowledge bases my vector-db folder has:

Total disk usage: 7.7 GiB Apparent size: 7.6 GiB Items: 9881

and it's not even user error it's because uploads fail when uploading many files(github .md docs)

@gilbrotheraway commented on GitHub (Mar 28, 2025): in 24h with less than 5 knowledge bases my vector-db folder has: Total disk usage: 7.7 GiB Apparent size: 7.6 GiB Items: 9881 and it's not even user error it's because uploads fail when uploading many files(github .md docs)

GiteaMirror commented

@Constey commented on GitHub (Apr 1, 2025):

I think this issue still exists.
If i have a knowledgebase blowing up my vector db to 10gb and i remove the whole knowledgebase, my space (vector db) will not be freed up.
/var/lib/docker/volumes/open-webui/_data/vector_db

@Constey commented on GitHub (Apr 1, 2025): I think this issue still exists. If i have a knowledgebase blowing up my vector db to 10gb and i remove the whole knowledgebase, my space (vector db) will not be freed up. /var/lib/docker/volumes/open-webui/_data/vector_db

GiteaMirror commented

@Classic298 commented on GitHub (Apr 1, 2025):

yes, according to tim this is intentional :/

@Classic298 commented on GitHub (Apr 1, 2025): yes, according to tim this is intentional :/

GiteaMirror commented