mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 02:48:13 -05:00
[GH-ISSUE #15616] issue: PostgreSQL null bytes #33147
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @perelin on GitHub (Jul 9, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/15616
Originally assigned to: @Classic298 on GitHub.
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.6.15
Ollama Version (if applicable)
No response
Operating System
macOS
Browser (if applicable)
Chrome
Confirmation
README.md.Expected Behavior
Disclaimer
Bug report writeup supported by AI
Expected Behavior
Chat search functionality should work seamlessly when users search through their chat messages via the /api/v1/chats/search endpoint.
chats.py:160-186The search should process JSON data in the chat column without errors and return relevant chat results.Actual Behavior
Chat search returns HTTP 500 error with psycopg2.errors.UntranslatableCharacter: unsupported Unicode escape sequence when PostgreSQL encounters null bytes (\u0000) in JSON chat data. The error specifically occurs during PostgreSQL JSON processing:
chats.py:671-680Steps to Reproduce
Detailed Steps:
\u0000) in the message content/api/v1/chats/search?text=<search_term>&page=1get_chats_by_user_id_and_search_text()methodchats.py:572-723Logs & Screenshots
Docker Logs
Browser console Logs
Additional Information
Root Cause:
The issue stems from the database migration
242a2047eae0_update_chat_table.py:64-74that converted thechatcolumn from Text to JSON format. During this migration, text data containing null bytes was converted to JSON, but PostgreSQL's JSON functions cannot handle these characters when performing searches.Impact:
Proposed Solutions:
@Ithanil commented on GitHub (Jul 10, 2025):
Same issue/discussion: https://github.com/open-webui/open-webui/discussions/9318
Should by fixed by: https://github.com/open-webui/open-webui/pull/14957
@perelin commented on GitHub (Jul 21, 2025):
@Ithanil thanks! is there a release scheduled for the fix? I have not seen it in .17 and .18
@Classic298 commented on GitHub (Jul 21, 2025):
@perelin fix was included in 0.6.16
it's just missing from the changelog is all, but the fix is implemented
and for me, i can confirm it works :D
@perelin commented on GitHub (Jul 21, 2025):
Im on 0.6.16 and still get the errors on search. Any way I can support debugging this? A search from just now:
@rgaricano commented on GitHub (Jul 21, 2025):
I have similar issue because I have an ∀ (\u2200) in my site name, (with postgresql),
a workaround is filter & replace in SQL Clause, but I haven't gotten around to it yet.
@Classic298 commented on GitHub (Jul 21, 2025):
@perelin I think, based on your traceback, the following might be happening:
My PR #14957 only prevents new null characters from being introduced and from being searched for, but you have existing corrupted JSON data in your PostgreSQL database from before the fix. When PostgreSQL tries to parse these old records during search, it still fails with the same error.
Looking at your log:
CONTEXT: JSON data, line 1: ....g., classification models). Some early \u0000...This shows PostgreSQL hitting existing corrupted chat records that contain \u0000 characters from before my sanitization was implemented.
To fix this, you need to clean your existing data. Try to let AI build you a very simple script that connects to your database and replaces all null values with nothing in all rows of the chat table
My fix prevents future corruption and incorrect searches but doesn't (and can't!) retroactively clean existing data (would require database manipulation of every single Open WebUI instance). Can you try to build (e.g. with AI) and run a cleanup query and see if that resolves your search functionality?
@perelin commented on GitHub (Jul 22, 2025):
@Classic298 alright, thx! I did a manual clean up. Was easy since it was only one entry.
For others with the same problem:
@rgaricano commented on GitHub (Jul 22, 2025):
It's ok, but too radical delete the whole chat, maybe just replacing the char should be enought!
@Classic298 commented on GitHub (Jul 22, 2025):
yes replacing the chars would have been enough
@mramendi commented on GitHub (Sep 22, 2025):
I am hitting the problem on 0.6.30, and I started this setup on 0.6.28 so no data from older versions could have been there. I don't know which message could even contain null characters, but there are many Cyrillic messages.
Full excerpt from container log here https://pastes.io/openwebui-search-errpr
Key lines:
@mramendi commented on GitHub (Sep 22, 2025):
Forensics and then an attempt at recovery, all guided by the inimitable Kimi K2, which remembered a similar but from 0.3.x first - but it was embeddings then, full chat texts now.
This was the forensics and while some of the threads did bug out the others are perfectly normal, the last one is still ongoing and that's where Kimi is coming up with these commands. So I try its suggested patch attempts:
At this point Kimi thinks the JSON itself is broken and suggests a Python script, and we had quite a few turns getting it to do something, as attempts to just strip null bytes seemed to not work. What finally worked is this overkill that also removed valid discussion of
\u0000:at that point things got fixed and Kimi suggested a trigger to prevent future additions, but the first version of the trigger stopped all additions. Kimi suggested that something was calling exceptions within the trigger so it rewrite it to never raise, but now I don't know if it even does anything.
For now I have a working system thanks to a sassy and cavalier model (cavalier indeed, it never once suggested a backup of the database and its script ideas involved running the regex on serialized JSON - the recursive idea is actually from Qwen3 235B A22 Instruct but its own script was over 100 lines and I think it had other bugs). But I do hope this can be fixed. What happened here is most of the listed threads bugged out with the model aborting or me pressing Stop - that probably generates a null character and/or some other controls. The string needs to be sanitized for those before being written to PostgreSQL, but hopefully in a way that does not kill mere mentioning of
\u0000as a sequence.@Ithanil commented on GitHub (Sep 22, 2025):
That's all overly complicated, check the sanitize option of my cleanup script: https://github.com/Ithanil/openwebui-scripts/blob/update/scripts/cleanup_pg/cleanup_pg.py (but note that the whole script is meant to delete old chats and orphaned data).
Nevertheless, unfortunately it indeed still happens that \u0000 character make it into chats somehow @Classic298
@Classic298 commented on GitHub (Sep 22, 2025):
Thanks, will reopen the issue and when I have time I'll look into it, maybe there are other points of entry for data I did not think about
@mramendi commented on GitHub (Sep 23, 2025):
I removed the trigger because I was getting a hang that I am... still getting, so maybe the trigger was not the issue (hang is unrelated - as in, streaming from a model seems to stop, I switch to another chat and back to ths one and see the completed result; I'll try to collect enough info for its own bug).
So my search stopped again and I went looking for the offending row. And as far as I could work out from the dump, all the messages where this is happening have the role "assistant" but don't contain an assistant response, instead they contain a tool response. The tool in question is a web page scraper (I think it's the built-in one). So, the tool response contains null characters and these get written to the database. EDIT: I now realize what happened is the tool pulled raw PDF files; while I'll fix my search pathway to either avoid or decode PDF, the pgsql thing still needs fixing.
I hope this helps pinpoint/fix the issue.
@Classic298 commented on GitHub (Sep 23, 2025):
interesting. Tool use and document ingestion.
@rgaricano commented on GitHub (Sep 23, 2025):
and this solution for prevent this kind of issues and gain in security?:
(as prevention: always declare the correct client encoding (SET client_encoding TO 'UTF8'; or use the right psql -E/driver settings) and store binary data in bytea, not text/json )
Answer
You're asking for the specific code implementation to set PostgreSQL client encoding to UTF-8 and properly handle binary data storage in Open WebUI's database connections .
Main Database Connection Encoding
For the main database connection in
backend/open_webui/internal/db.py, you need to modify the PostgreSQL engine creation to include client encoding 1 :Environment Variable Configuration
You can also set the encoding through the
DATABASE_URLenvironment variable 2 :Connection Event Handler for Encoding
Add an event listener to ensure encoding is set on every connection:
Binary Data Storage Implementation
For binary data that shouldn't be stored as JSON/text, you can create a new column type. Here's an example for storing encrypted chat data:
Migration for Binary Storage
Create a migration to add binary storage capability:
Utility Functions for Binary Handling
Add utility functions to handle binary data conversion:
Notes
This implementation provides multiple layers of protection: client encoding ensures proper UTF-8 handling at the connection level, while binary storage with
byteaprovides a secure way to store sensitive data that might contain problematic characters. The existing JSON storage 3 can be maintained for compatibility while adding binary storage for enhanced security.Citations
File: backend/open_webui/internal/db.py (L128-145)
File: backend/open_webui/env.py (L272-279)
File: backend/open_webui/models/chats.py (L32-32)
@Classic298 commented on GitHub (Sep 23, 2025):
should we add this to the docs?
@rgaricano commented on GitHub (Sep 23, 2025):
I think that without encoding the situation is the same, e.g. db can be setup for utf8 but I'm using latin-1 in my chats and introducing "UntranslatableCharacter" errors.
@Classic298 commented on GitHub (Sep 24, 2025):
https://github.com/open-webui/open-webui/pull/17723
@Classic298 commented on GitHub (Sep 24, 2025):
should be a more central and all-round solution
of course doesnt work for already existing data, needs to be manually cleaned once via script
@rgaricano commented on GitHub (Sep 24, 2025):
I don't think that sanitazion the json is the best option, yes, it can solve the problem and maybe usefull for title and other IA generated text, but it can change the sense of the information of other data provided.
The de/encryption json strings & binary storage solution seem most robust for preserving data.
@Classic298 commented on GitHub (Sep 25, 2025):
hm how it could change the sense of the information?
I am only substituting null strings
@Classic298 commented on GitHub (Sep 26, 2025):
https://github.com/open-webui/open-webui/pull/17775
@mramendi commented on GitHub (Sep 27, 2025):
I suspect that the currently proposed approach might be sanitizing things a bit too late. The null bytes appear, in my experience, either because of nonstandard tool responses (PDF file stuck in without even bothering with base64) or broken LLM responses (inference server weirds out and responds with a null). These should be caught before they are inducted into the context.
@Classic298 commented on GitHub (Sep 28, 2025):
@mramendi while you're right in principle that data should be sanitized as soon as possible, it is easier to implement at the level that is proposed in #17775 and it is sufficient to fix the postgreSQL errors, because the errors stem from the search; and the data does get sanitized before inserted into the database, ensuring these errors should not appear anymore and the search to work again.
It will not make a difference in the end, as long as the data is sanitized before sending it to the database, and I am not sure additional complexity to ensure data is sanitized earlier to this solution will be beneficial
@abiari commented on GitHub (Oct 8, 2025):
@Classic298 I went through all the issues about this subject, did my own tests and found out that sanitizing "\x00" doesn't take care of '\u0000' (which gave me the worst headache trying to query search my pg db) as well, they might be identical in value, but when running a query to search the chat json, '\u0000' doesn't seem to be identical to '\x00'.
I managed to clean my db by replacing '\u0000' with an escaped '\\u0000', queries run with no errors now, and context is preserved in my chats
Edit: "\\u0000" didn't help for all cases, some being binary file data from web scraping. Still need to find and replace '\u0000' with an empty string
@Classic298 commented on GitHub (Oct 9, 2025):
@abiari I have made a slight modification to my proposed PR
Can you test it again with a web search or a file you know for sure causes problems?
The thing is: \x00 and \u0000 are the same exact thing just in different notations.
I hope my new code will now correctly interpret and remove all these null characters. Testing wanted!
@abiari commented on GitHub (Oct 9, 2025):
@Classic298 I tested your approach, the search an replace is actually done on the data as strings, the replace function will look for the '\x00' string and won't match on '\u0000'
@abiari commented on GitHub (Oct 9, 2025):
I very much agree with this as well, a de/encryption of documents/images data should be preferable to preserve context.
@Classic298 commented on GitHub (Oct 9, 2025):
But how so? When is null data ever useful?
Null bytes are NEVER legitimate in text data
They only exist in Binary data (PDFs, images - which are not being searched with the search function anyways and cannot be sanitized either, so that doesn't matter here)
Binary data / null data shouldn't be in text fields, or in responses (that would make them corrupted responses)
The encryption suggestion is overkill imo. This isn't about security, it's about data format
Can you show it to me please? What did you test or how did you test it?
the .replace function does not search the exact string matches in the strings as you say! It WILL match \x00 with \u0000!
@Classic298 commented on GitHub (Oct 9, 2025):
Run this code for yourself. You will see that \x00 WILL replace the \u0000 string @abiari
@Classic298 commented on GitHub (Nov 23, 2025):
fixed in dev