mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 11:28:35 -05:00
[GH-ISSUE #14670] issue: Azure Document Intelligence can crash Open WebUI #32859
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jimbo-p on GitHub (Jun 4, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/14670
Originally assigned to: @tjbck on GitHub.
Check Existing Issues
Installation Method
Docker
Open WebUI Version
0.6.13
Ollama Version (if applicable)
No response
Operating System
Windows 11
Browser (if applicable)
Firefox
Confirmation
README.md.Expected Behavior
I drag and drop a larger PDF into Open WebUI. Larger being defined as a document that may take more than one minute to OCR (i.e. > 5-10 MB). It uses Azure Document Intelligence to OCR that PDF in preparation for RAG workflow.
Actual Behavior
I drag and drop a larger PDF into Open WebUI. It uses Azure Document Intelligence to OCR and crashes Open WebUI entirely if it takes > 45 seconds to complete the OCR.
After ~45 seconds, the document disappears out of the chatbox and then the Open WebUI interface becomes unresponsive.
Steps to Reproduce
Logs & Screenshots
Cloudwatch Logs
2025-06-04T18:30:55.538Z
INFO | Request URL: https://opstech-form-recognizer.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-layout:analyze?api-version=REDACTED&outputContentFormat=REDACTED
Request method: POST
Request headers:
content-type: application/octet-stream
Accept: application/json
x-ms-client-request-id: 0d667c42-4172-11f0-bfc1-0a58a9feac02
x-ms-useragent: REDACTED
User-Agent: azsdk-python-ai-documentintelligence/1.0.0 Python/3.11.12 (Linux-5.10.235-227.919.amzn2.x86_64-x86_64-with-glibc2.36)
Ocp-Apim-Subscription-Key: REDACTED
A body is sent with the request.
DEBUG | Starting new HTTPS connection: opstech-form-recognizer.cognitiveservices.azure.com:443
2025-06-04T18:30:57.757Z
DEBUG | HTTPS POST to /documentintelligence/documentModels/prebuilt-layout:analyze?...
Response: 202
2025-06-04T18:30:57.763Z
INFO | Response status: 202
Response headers:
Date: Wed, 04 Jun 2025 18:30:57 GMT
Content-Length: 0
Connection: keep-alive
Operation-Location: REDACTED
x-envoy-upstream-service-time: REDACTED
apim-request-id: REDACTED
Strict-Transport-Security: REDACTED
x-content-type-options: REDACTED
x-ms-region: REDACTED
2025-06-04T18:30:57.764Z
INFO | Request URL: https://opstech-form-recognizer.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-layout/analyzeResults/b1418de2-256b-445e-8876-6a963abdfb0f?api-version=REDACTED
Request method: GET
Request headers:
x-ms-client-request-id: 0d667c42-4172-11f0-bfc1-0a58a9feac02
x-ms-useragent: REDACTED
User-Agent: azsdk-python-ai-documentintelligence/1.0.0 Python/3.11.12 (Linux-5.10.235-227.919.amzn2.x86_64-x86_64-with-glibc2.36)
Ocp-Apim-Subscription-Key: REDACTED
No body was attached to the request - {}
....
2025-06-04T18:31:39.116Z
Ocp-Apim-Subscription-Key: REDACTED
No body was attached to the request - {}
2025-06-04T18:31:41.495Z
DEBUG | HTTPS GET to /documentintelligence/documentModels/prebuilt-layout/analyzeResults/b1418de2-256b-445e-8876-6a963abdfb0f?...
Response: 200
2025-06-04T18:31:46.242Z
INFO | Response status: 200
Response headers:
Date: Wed, 04 Jun 2025 18:31:41 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 57325967
Connection: keep-alive
x-envoy-upstream-service-time: REDACTED
apim-request-id: REDACTED
Strict-Transport-Security: REDACTED
x-content-type-options: REDACTED
x-ms-region: REDACTED
Browser Logs:
drop { target: p.is-empty.is-editor-empty, buttons: 0, clientX: 678, clientY: 229, layerX: 156, layerY: 18 }
CBWlh6ps.js:15:56554
Array [ File ]
CBWlh6ps.js:15:56687
Input files handler called with:
Array [ File ]
CBWlh6ps.js:15:55174
Processing file:
Object { name: "All Fluid Levels in Sample.pdf", type: "application/pdf", size: 24858908, extension: "pdf" }
CBWlh6ps.js:15:55258
Object { type: "file", file: "", id: null, url: "", name: "All Fluid Levels in Sample.pdf", collection_name: "", status: "uploading", size: 24858908, error: "", itemId: "30b37fea-08a9-456a-9d3c-98947fb6568a" }
DlDIEvos.js:7:7076
XHRPOST
http://internal-oxygpt-alb-dev-1998233518.us-east-1.elb.amazonaws.com/api/v1/files/
[HTTP/1.1 504 Gateway Time-out 69605ms]
SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data Bt_AvK56.js:1:365
a index.ts:26
(Async: promise callback)
s index.ts:24
Et MessageInput.svelte:265
on MessageInput.svelte:352
on MessageInput.svelte:298
mn MessageInput.svelte:380
Additional Information
The logs are very verbose so I didn't include everything but Open WebUI continuously makes calls to Azure Document Intelligence (as it should) until the document is ready. The failure that occurs is sudden, my logs show no error message but the Open WebUI app crashes and has to be restarted.
On testing, it appears to happen after ~50-60 seconds of not receiving an OCR result.
@decent-engineer-decent-datascientist commented on GitHub (Jun 4, 2025):
I’m able to recreate this on our setup.
@pierrelouisbescond commented on GitHub (Jun 5, 2025):
I confirm that I've been able to reproduce this bug using Azure Document Intelligence and a 6.3 MB Arxiv PDF document (https://arxiv.org/pdf/2505.24876).
The uploaded document simply disappears from the chat UI.
@iamcristi commented on GitHub (Jun 5, 2025):
I've also reproduced this, I've noticed cpu goes to 100% and RAM usage grows until OOM while stuck at
53764fe648/backend/open_webui/routers/retrieval.py (L1125)@jackthgu commented on GitHub (Jun 13, 2025):
Hello, are you currently using the paid version of Azure Document Intelligence?
@pierrelouisbescond commented on GitHub (Jun 13, 2025):
Yes, we have an Azure company subscription.
@jackthgu commented on GitHub (Jun 15, 2025):
It looks like the request dies right at most reverse-proxy time-outs . Could you tell me:
proxy_read_timeout,proxy_send_timeout)504or “upstream timed out”Reload Nginx, rerun the upload.
If it works, we can fine-tune or document the fix. Let me know!
@jimbo-p commented on GitHub (Jun 16, 2025):
I went ahead and updated my timeout on the ALB to 15 minutes. Open WebUI is no longer crashing and the document isn't disappearing out of the chatbox. However, it does spin forever. After the previous default timeout (~60 seconds), it looks like Open WebUI gives up on making requests to the doc intelligence endpoint to check for if OCR has been completed.
@tjbck commented on GitHub (Jun 20, 2025):
Potentially related to #15023
@jackthgu commented on GitHub (Jun 20, 2025):
As I'm currently using the free tier of Document Intelligence, I'm unable to fully replicate the issue on my end. Could you kindly share the full logs from your side for that specific scenario? It would be very helpful.
@zolgear commented on GitHub (Jun 23, 2025):
Error situation:
PDF file: 19MB, 484 pages
Azure Document Intelligence API call completes successfully and is confirmed in the logs:
After that, it seems that
split_documentsis running until an OOM (Out Of Memory) error occurs.Tested on a machine with 32 vCPUs and 128GB RAM.
Even when CPU usage reached 100% and RAM usage exceeded 100GB, the embedding process did not start.
@fmonnier74 commented on GitHub (Jul 21, 2025):
Was able to reproduce, even without Document intelligence.
To reproduce :
if Openweb-UI behind a reverse-proxy : Upload a large full text file that will take more time than the proxy timeout. This is when the file disapear from the interface.
if not behind a reverse-proxy : Upload a full text large file and kill your browser while the upload process is still on.
Result :
Openweb-ui will start to allocate memory until OOM no matter what. Document intelligence is just a catalyst to reach this timeout faster since it gets more data from files.
I have 200 users on my deployment and I am struggling with this issue since many users do upload any file they want regardless of the current limitation.
Workaround :
Take a worst case scenario, for exemple a full text file of about 20MB, measure the time your setup will take to perform the vectorization, and set the timeout of your proxy to the time you measured + some magin and set max file size to 20MB. That is how I approached the issue.
Hope this can help.
@Maximilian-Pichler commented on GitHub (Aug 18, 2025):
i think we're running into two separate issues here: first,
it seems like there's a proxy timeout that could be affecting large uploads; second, the way embeddings and document uploads are handled appears to be very resource-intensive.
our setup includes openwebui (hosted as an azure webapp with 32gb ram), document intelligence, and postgres with pg-vector. whenever i try uploading a large pdf (for example, a 1000-page, 33mb document), the app runs out of memory, crashes, and restarts. any advice on how to tackle these problems or optimize our resource usage would be super helpful!
@zolgear commented on GitHub (Sep 1, 2025):
With v0.6.26, PDF transcription in Notes works.
Embedding in the backend causes memory to increase until OOM.
Restarting the container before server crashes allows work to continue.
Enabling "Bypass Embedding and Retrieval" avoids the problem for now.
@Maximilian-Pichler commented on GitHub (Sep 2, 2025):
the issue still persists with v.0.6.26
@tjbck commented on GitHub (Sep 11, 2025):
We may have addressed this in dev! Testing wanted here! @Maximilian-Pichler @zolgear
@zolgear commented on GitHub (Sep 12, 2025):
I tested using the
ghcr.io/open-webui/open-webui:dev-cudaimage.Test case: Knowledge Base
PDF file size: 9 MB, 132 pages.
The issue of memory usage increasing indefinitely has been improved.
However, when the page count is large, Embedding can consume 20 GB–30 GB of memory, which is still a challenge for practical use — but this is a step forward!
@asabla commented on GitHub (Sep 23, 2025):
We've had the exact same issue as mentioned previously in this issue. And it seems like, depending on which tier of Document Intelligence you're using, you'll also be able to get more data out of OCR in the responses.
The main issue, seems to be caused by some of the fields stored in embedding_metadata, which grows the vector database at an insane rate. On top of that, there also seems to be an issue with self-referencing json (might cause some of out-of-memory issues we're seeing).
Initial testing with removing some of the fields during stringification of the metadata in utils.py found at
./backend/open_webui/retrieval/vector/utils.pyseems to be enough. So far we've removed the following keys in that function:I do realize that some of these fields are very useful during RAG pipeline, but I do not believe the whole complicated json structure returned by Azure Document Intelligence is necessary in order to do so.
Suggestion:
Remove some of the returned fields (like content) and simplify the JSON structure. Another alternative is to ignore some/all of these fields for now
@decent-engineer-decent-datascientist commented on GitHub (Sep 29, 2025):
@asabla Any chance you've verified the extracted tables exist in the text representation as well? it'd be a shame if we were to lose the tables all together.
@asabla commented on GitHub (Oct 2, 2025):
@decent-engineer-decent-datascientist the tables are somewhat represented depending on what the documents looks like. Like I mentioned before, it would probably be enough to reduce the default amount of allowed json complexity + fixing self-referencing issue.
@tjbck commented on GitHub (Oct 2, 2025):
@asabla those metadata should've been removed in the latest release. Excel files still seem to take forever to retrieve the parsed content so we're updated our internal content extraction logic to use the built-in method instead FYI.
@asabla commented on GitHub (Nov 10, 2025):
alright @tjbck been testing around with different vector stores. And so far the changes seems to have solved it. Unless anyone else has or finds any further issues, I would consider this as solved
@Maximilian-Pichler commented on GitHub (Nov 10, 2025):
can confirm
@Ruben-Wien commented on GitHub (Nov 12, 2025):
Is there a PR associated with this fix? Cant find it.