[GH-ISSUE #687] Uploading documents connects to external web services such as an AWS ELB? #50844

Closed
opened 2026-05-05 11:21:47 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @prologic on GitHub (Feb 9, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/687

Bug Report

Description

Bug Summary:

I tried to upload a document to my locally hosted instance of Ollama Web UI and to my horror I discovered that the Docker container (running Ollaba Web UI) wanted to connect to an AWS ELB?! Naturally I blocked this connection (thanks to LittleSnitch). Then it wanted to connect to another external services, some packages (didn't capture it).

Steps to Reproduce:

  • Install a filtering/logging firewall like LittleSnitch
  • Upload a document
  • Observe external connections made

Expected Behavior:

I don't know wtf this is trying to do, but I really DO NOT expect a locally hosted instance of anything to be connecting externally to some 3rd-party services (within reason of course). This is absurd.

At the very least, could someone please explain why this is happening and what this is even used for? Maybe it's legit and required for some part of the "Upload Document" user journey to work?

Actual Behavior:

I expect locally hosted software to NOT connect to external services. The whole point of using Ollama in the first place is to run local LLM models 😅

Environment

Not really relevant. But Docker container on a Mac.

PS: Your Issue template is too long. Please simplify it, I don't generally have and time and patience to fill out everything asked, especially of a vision impaired person. It also takes some of the "human"(ity) out of helping to contribute to "better" open source software.

Originally created by @prologic on GitHub (Feb 9, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/687 # Bug Report ## Description **Bug Summary:** I tried to upload a document to my locally hosted instance of Ollama Web UI and to my horror I discovered that the Docker container (running Ollaba Web UI) wanted to connect to an AWS ELB?! Naturally I blocked this connection (thanks to LittleSnitch). Then it wanted to connect to another external services, some packages (didn't capture it). **Steps to Reproduce:** - Install a filtering/logging firewall like LittleSnitch - Upload a document - Observe external connections made **Expected Behavior:** I don't know wtf this is _trying_ to do, but I **really** DO NOT expect a locally hosted instance of anything to be connecting externally to some 3rd-party services (_within reason of course_). This is absurd. At the very least, could someone please explain why this is happening and what this is even used for? Maybe it's legit and required for some part of the "Upload Document" user journey to work? **Actual Behavior:** I expect locally hosted software to **NOT** connect to external services. The whole point of using Ollama in the first place is to run local LLM models 😅 ## Environment Not really relevant. But Docker container on a Mac. PS: Your Issue template is too long. Please simplify it, I don't generally have and time and patience to fill out everything asked, especially of a vision impaired person. It also takes some of the "human"(ity) out of helping to contribute to "better" open source software.
Author
Owner

@prologic commented on GitHub (Feb 9, 2024):

FWIW blocking the two connections didn't appear to affect the functionality of Uploading a document. I was later able to select it and use it in context with #, so I'm really confused as to why those connections are even necessary at all 🤔

<!-- gh-comment-id:1936679462 --> @prologic commented on GitHub (Feb 9, 2024): FWIW blocking the two connections didn't appear to affect the functionality of Uploading a document. I was later able to select it and use it in context with `#`, so I'm really confused as to why those connections are even necessary at all 🤔
Author
Owner

@tjbck commented on GitHub (Feb 9, 2024):

Hi, Thanks for reporting this issue. Could you verify that AWS ELB connection is 100% occurring from the webui-side? Our backend code does no contain any code that explicitly makes connection with AWS ELB, so my guess is the request is made from one of our dependency libraries. If you could narrow down what part of the code making the connection, that would be tremendously helpful, Thanks!

<!-- gh-comment-id:1936682892 --> @tjbck commented on GitHub (Feb 9, 2024): Hi, Thanks for reporting this issue. Could you verify that AWS ELB connection is 100% occurring from the webui-side? Our backend code does no contain any code that explicitly makes connection with AWS ELB, so my guess is the request is made from one of our dependency libraries. If you could narrow down what part of the code making the connection, that would be tremendously helpful, Thanks!
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Yup makes sense!

I'll try to narrow this down 👌 As you said, If you're not doing this explicitly in this codebase then I consider a sneaky supply chain type of thing 🤣

<!-- gh-comment-id:1936775008 --> @prologic commented on GitHub (Feb 10, 2024): Yup makes sense! I'll try to narrow this down 👌 As you said, If you're not doing this explicitly in this codebase then I consider a sneaky supply chain type of thing 🤣
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

So here we go:

Text version(s):

Docker wants to connect to a046be49099ce4659abbcfa853797f20-5fd7cc9498e4883e.elb.ap-southeast-1.amazonaws.com on TCP port 443 (https)

Docker wants to connect to packages.unstructured.io on TCP port 443 (https)

Screenshots:
Screenshot 2024-02-10 at 11 12 24
Screenshot 2024-02-10 at 11 12 55

<!-- gh-comment-id:1936790068 --> @prologic commented on GitHub (Feb 10, 2024): So here we go: Text version(s): ``` Docker wants to connect to a046be49099ce4659abbcfa853797f20-5fd7cc9498e4883e.elb.ap-southeast-1.amazonaws.com on TCP port 443 (https) Docker wants to connect to packages.unstructured.io on TCP port 443 (https) ``` Screenshots: <img width="678" alt="Screenshot 2024-02-10 at 11 12 24" src="https://github.com/ollama-webui/ollama-webui/assets/1290234/022dca20-690f-441e-a8df-7e29d76ae68c"> <img width="636" alt="Screenshot 2024-02-10 at 11 12 55" src="https://github.com/ollama-webui/ollama-webui/assets/1290234/514b7ae1-3f69-4b79-8741-8765b6db8d57">
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Note that this is the container itself trying to do this, so something to do with the backend.

<!-- gh-comment-id:1936791051 --> @prologic commented on GitHub (Feb 10, 2024): Note that this is the container itself _trying_ to do this, so something to do with the backend.
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Doing a search for the 2nd connection yield this:

d11c70cf83/unstructured/utils.py (L287-L319)

Are we using this in the backenda anywhere? 🤔

<!-- gh-comment-id:1936791803 --> @prologic commented on GitHub (Feb 10, 2024): Doing a [search](https://github.com/search?q=%22packages.unstructured.io%22&ref=opensearch&type=code) for the 2nd connection yield this: https://github.com/Unstructured-IO/unstructured/blob/d11c70cf83fdb8a08fed2cf01c6c0bd114d817df/unstructured/utils.py#L287-L319 Are we using this in the backenda anywhere? 🤔
Author
Owner

@tjbck commented on GitHub (Feb 10, 2024):

Here's a list of our suspects:

langchain
langchain-community
chromadb
sentence_transformers
pypdf
docx2txt
unstructured
markdown
pypandoc
pandas
openpyxl
pyxlsb
xlrd
<!-- gh-comment-id:1936792167 --> @tjbck commented on GitHub (Feb 10, 2024): Here's a list of our suspects: ``` langchain langchain-community chromadb sentence_transformers pypdf docx2txt unstructured markdown pypandoc pandas openpyxl pyxlsb xlrd ```
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

We are:

cb5520c519/backend/requirements.txt (L25)

Why does it need to connect to an external service? 🤔

<!-- gh-comment-id:1936792250 --> @prologic commented on GitHub (Feb 10, 2024): We are: https://github.com/ollama-webui/ollama-webui/blob/cb5520c519dde81bfe08ce358753ab7f11417f97/backend/requirements.txt#L25 Why does it need to connect to an external service? 🤔
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

I can't figure out this random ELB though, might need some help figuring that one out. But at least we have some culprits now.... The question is, what do we do about it? Blocking both doesn't adversely affect Ollama Web UI in any way that I can tell hmmm

<!-- gh-comment-id:1936792578 --> @prologic commented on GitHub (Feb 10, 2024): I can't figure out this random ELB though, might need some help figuring that one out. But at least we have some culprits now.... The question is, what do we do about it? Blocking both doesn't adversely affect Ollama Web UI in any way that I can tell hmmm
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Oh wow!

def scarf_analytics():
...

If this library is sending analytics, that's disgusting 😱

<!-- gh-comment-id:1936793045 --> @prologic commented on GitHub (Feb 10, 2024): Oh wow! ```python def scarf_analytics(): ... ``` If this library is sending analytics, that's disgusting 😱
Author
Owner

@tjbck commented on GitHub (Feb 10, 2024):

UnstructuredMarkdownLoader seems to be the culprit, investigating more.

<!-- gh-comment-id:1936793059 --> @tjbck commented on GitHub (Feb 10, 2024): `UnstructuredMarkdownLoader` seems to be the culprit, investigating more.
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

www.unstructured.io/

I have half a mind to go yell at this company and ask them to please explain themselves 🤣 Shame on them!

<!-- gh-comment-id:1936793795 --> @prologic commented on GitHub (Feb 10, 2024): >Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines. > > [www.unstructured.io/](https://www.unstructured.io/) I have half a mind to go yell at this company and ask them to please explain themselves 🤣 Shame on them!
Author
Owner

@tjbck commented on GitHub (Feb 10, 2024):

Just reviewed the code, I reckon setting DO_NOT_TRACK env var to True will stop the telemetry, could you try testing it?

<!-- gh-comment-id:1936794370 --> @tjbck commented on GitHub (Feb 10, 2024): Just reviewed the code, I reckon setting `DO_NOT_TRACK` env var to `True` will stop the telemetry, could you try testing it?
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Love it! Let's do it, happy to test the fix 👌

<!-- gh-comment-id:1936795122 --> @prologic commented on GitHub (Feb 10, 2024): Love it! Let's do it, happy to test the fix 👌
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

And thank you for responding to this so quickly! When you're self hosting and insisting on doing things locally, you really don't expect your software to reach out to the internet without you knowing about it 😅

<!-- gh-comment-id:1936795451 --> @prologic commented on GitHub (Feb 10, 2024): And thank you for responding to this so quickly! When you're self hosting and insisting on doing things locally, you really don't expect your software to reach out to the internet without you knowing about it 😅
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Some kudos I posted for you 😅

<!-- gh-comment-id:1936798614 --> @prologic commented on GitHub (Feb 10, 2024): Some kudos I [posted](https://twtxt.net/twt/me3ac2a) for you 😅
Author
Owner

@justinh-rahb commented on GitHub (Feb 10, 2024):

Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere?

<!-- gh-comment-id:1936801000 --> @justinh-rahb commented on GitHub (Feb 10, 2024): Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere?
Author
Owner

@prologic commented on GitHub (Feb 10, 2024):

Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere?

Are you suggesting we file a bug upstream too? It was a bit of a rude surprise to be honest 😅

<!-- gh-comment-id:1936801500 --> @prologic commented on GitHub (Feb 10, 2024): > Good find guys, ya that definitely not nice of them to do. Is there any disclosure from the libary anywhere? Are you suggesting we file a bug upstream too? It was a bit of a rude surprise to be honest 😅
Author
Owner

@tjbck commented on GitHub (Feb 10, 2024):

@justinh-rahb none I can find from their readme :/

EDIT: they do mention at the very bottom of their readme to set the environment variable SCARF_NO_ANALYTICS=true.

<!-- gh-comment-id:1936811472 --> @tjbck commented on GitHub (Feb 10, 2024): @justinh-rahb none I can find from their [readme](https://github.com/Unstructured-IO/unstructured) :/ EDIT: they do mention at the very bottom of their readme to set the environment variable `SCARF_NO_ANALYTICS=true`.
Author
Owner

@tjbck commented on GitHub (Feb 10, 2024):

Added

ENV SCARF_NO_ANALYTICS true
ENV DO_NOT_TRACK true

with #694, it should disable the telemetry. Please try it out and let me know!

<!-- gh-comment-id:1936815842 --> @tjbck commented on GitHub (Feb 10, 2024): Added ``` ENV SCARF_NO_ANALYTICS true ENV DO_NOT_TRACK true ``` with #694, it should disable the telemetry. Please try it out and let me know!
Author
Owner

@justinh-rahb commented on GitHub (Feb 10, 2024):

With RAG being as hot as it is right now, I guess we shouldn't be surprised that some libary authors are cashing in on the user data flowing through their code.

Perhaps it'll be prudent to think about dependency audits in the future. With Ollama now supporting a broad range of CPU-only configurations, it can be integrated into GitHub Actions, along with Ollama-WebUI for thorough end-to-end testing. I'm going to give this a think over the weekend, I seem to recall there being a thread in discussions about using the webUI API directly that may come in handy here, time to do some research...

<!-- gh-comment-id:1936820298 --> @justinh-rahb commented on GitHub (Feb 10, 2024): With RAG being as hot as it is right now, I guess we shouldn't be surprised that some libary authors are cashing in on the user data flowing through their code. Perhaps it'll be prudent to think about dependency audits in the future. With Ollama now supporting a broad range of CPU-only configurations, it can be integrated into GitHub Actions, along with Ollama-WebUI for thorough end-to-end testing. I'm going to give this a think over the weekend, I seem to recall there being a thread in discussions about using the webUI API directly that may come in handy here, time to do some research...
Author
Owner

@tjbck commented on GitHub (Feb 13, 2024):

@prologic has the issue been resolved with the latest release?

<!-- gh-comment-id:1939933309 --> @tjbck commented on GitHub (Feb 13, 2024): @prologic has the issue been resolved with the latest release?
Author
Owner

@prologic commented on GitHub (Feb 13, 2024):

I pulled the latest Docker image and restarted my local instance and so far so good 😊

<!-- gh-comment-id:1941474710 --> @prologic commented on GitHub (Feb 13, 2024): I pulled the latest Docker image and restarted my local instance and so far so good 😊
Author
Owner

@tjbck commented on GitHub (Feb 14, 2024):

I'll close this issue for now, feel free to open new issues if you encounter any spywares from the dependency supply chain, thanks!

<!-- gh-comment-id:1943265272 --> @tjbck commented on GitHub (Feb 14, 2024): I'll close this issue for now, feel free to open new issues if you encounter any spywares from the dependency supply chain, thanks!
Author
Owner

@aswani-ms commented on GitHub (Jun 21, 2024):

Do you have an example code how to upload a document programatically through an api? is it possible

<!-- gh-comment-id:2183108737 --> @aswani-ms commented on GitHub (Jun 21, 2024): Do you have an example code how to upload a document programatically through an api? is it possible
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#50844