[PR #23039] [CLOSED] fix: remove null bytes from PDF metadata to prevent PostgreSQL JSONB errors #50036

Closed
opened 2026-04-30 02:31:43 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/23039
Author: @yang1002378395-cmyk
Created: 3/25/2026
Status: Closed

Base: devHead: fix-pdf-null-byte-22992


📝 Commits (1)

  • 697ab41 fix: remove null bytes from PDF metadata to prevent PostgreSQL JSONB errors

📊 Changes

1 file changed (+3 additions, -0 deletions)

View changed files

📝 backend/open_webui/retrieval/vector/utils.py (+3 -0)

📄 Description

Pull Request Checklist

  • Target branch: This PR targets the dev branch
  • Description: Fix for Issue #22992 - PDF upload fails with PostgreSQL backend
  • Testing: Tested locally with null byte removal logic

Changelog Entry

Description

PostgreSQL JSONB cannot handle null bytes (\x00) in strings. Some PDF metadata contains null bytes (e.g., "Adobe PSL 1.3e for Canon\x00") which causes DataError: unsupported Unicode escape sequence when inserting document chunks.

Fixed

  • Added null byte filtering in process_metadata function
  • Removes control characters (ord < 32) except newlines (\n), carriage returns (\r), and tabs (\t)
  • Ensures metadata is safe for PostgreSQL JSONB storage

Root Cause

The process_metadata function in backend/open_webui/retrieval/vector/utils.py converts non-serializable types to strings but does not sanitize string values. When PDF metadata contains null bytes, PostgreSQL raises an error during INSERT.

Files Changed

  • backend/open_webui/retrieval/vector/utils.py: Added string sanitization to remove null bytes

Testing

# Before fix
metadata = {"producer": "Adobe PSL 1.3e for Canon\\x00"}
# INSERT fails with: DataError: unsupported Unicode escape sequence

# After fix
result = process_metadata(metadata)
# result["producer"] = "Adobe PSL 1.3e for Canon" (null byte removed)
# INSERT succeeds

Fixes #22992


Contributor License Agreement


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/23039 **Author:** [@yang1002378395-cmyk](https://github.com/yang1002378395-cmyk) **Created:** 3/25/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix-pdf-null-byte-22992` --- ### 📝 Commits (1) - [`697ab41`](https://github.com/open-webui/open-webui/commit/697ab4104745d5476d5d5048ea3d1192834dcee6) fix: remove null bytes from PDF metadata to prevent PostgreSQL JSONB errors ### 📊 Changes **1 file changed** (+3 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/vector/utils.py` (+3 -0) </details> ### 📄 Description # Pull Request Checklist - [x] **Target branch:** This PR targets the `dev` branch - [x] **Description:** Fix for Issue #22992 - PDF upload fails with PostgreSQL backend - [x] **Testing:** Tested locally with null byte removal logic ## Changelog Entry ### Description PostgreSQL JSONB cannot handle null bytes (`\x00`) in strings. Some PDF metadata contains null bytes (e.g., `"Adobe PSL 1.3e for Canon\x00"`) which causes `DataError: unsupported Unicode escape sequence` when inserting document chunks. ### Fixed - Added null byte filtering in `process_metadata` function - Removes control characters (ord < 32) except newlines (`\n`), carriage returns (`\r`), and tabs (`\t`) - Ensures metadata is safe for PostgreSQL JSONB storage ### Root Cause The `process_metadata` function in `backend/open_webui/retrieval/vector/utils.py` converts non-serializable types to strings but does not sanitize string values. When PDF metadata contains null bytes, PostgreSQL raises an error during INSERT. ### Files Changed - `backend/open_webui/retrieval/vector/utils.py`: Added string sanitization to remove null bytes ### Testing ```python # Before fix metadata = {"producer": "Adobe PSL 1.3e for Canon\\x00"} # INSERT fails with: DataError: unsupported Unicode escape sequence # After fix result = process_metadata(metadata) # result["producer"] = "Adobe PSL 1.3e for Canon" (null byte removed) # INSERT succeeds ``` ### Related Issues Fixes #22992 --- ### Contributor License Agreement - [x] By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 02:31:43 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#50036