[PR #23040] [CLOSED] fix: remove null bytes from metadata to prevent PostgreSQL JSONB errors #65845

Closed
opened 2026-05-06 11:50:48 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/23040
Author: @yang1002378395-cmyk
Created: 3/25/2026
Status: Closed

Base: devHead: fix-pdf-null-byte-v2


📝 Commits (1)

  • 7972e61 fix: remove null bytes from metadata to prevent PostgreSQL JSONB errors

📊 Changes

1 file changed (+25 additions, -5 deletions)

View changed files

📝 backend/open_webui/retrieval/vector/utils.py (+25 -5)

📄 Description

Summary

Fixes #22992

Removes null bytes and invalid control characters from metadata strings before storing in vector database. This prevents PostgreSQL JSONB errors when processing PDFs with malformed metadata.

Changes

  • Added _clean_string_value() function to remove null bytes and invalid control characters
  • Added _clean_value() for recursive cleaning of nested structures (dict, list)
  • Updated filter_metadata() and process_metadata() to clean all string values

Testing

# Test with null byte in PDF metadata
test_metadata = {'producer': 'Adobe PDF Library 15.0\x00'}
result = filter_metadata(test_metadata)
# result['producer'] == 'Adobe PDF Library 15.0'  # null byte removed

All unit tests passed.

Issue Reference

Closes #22992


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/23040 **Author:** [@yang1002378395-cmyk](https://github.com/yang1002378395-cmyk) **Created:** 3/25/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix-pdf-null-byte-v2` --- ### 📝 Commits (1) - [`7972e61`](https://github.com/open-webui/open-webui/commit/7972e614a28c3d9089bcef26d787e39c6605d309) fix: remove null bytes from metadata to prevent PostgreSQL JSONB errors ### 📊 Changes **1 file changed** (+25 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/vector/utils.py` (+25 -5) </details> ### 📄 Description ## Summary Fixes #22992 Removes null bytes and invalid control characters from metadata strings before storing in vector database. This prevents PostgreSQL JSONB errors when processing PDFs with malformed metadata. ## Changes - Added `_clean_string_value()` function to remove null bytes and invalid control characters - Added `_clean_value()` for recursive cleaning of nested structures (dict, list) - Updated `filter_metadata()` and `process_metadata()` to clean all string values ## Testing ```python # Test with null byte in PDF metadata test_metadata = {'producer': 'Adobe PDF Library 15.0\x00'} result = filter_metadata(test_metadata) # result['producer'] == 'Adobe PDF Library 15.0' # null byte removed ``` All unit tests passed. ## Issue Reference Closes #22992 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 11:50:48 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#65845