ollama API streaming does not stream #778

Closed
opened 2025-11-11 14:31:04 -06:00 by GiteaMirror · 6 comments
Owner

Originally created by @ProjectMoon on GitHub (May 1, 2024).

Bug Report

Description

Bug Summary:
If you set the stream parameter to true on the /ollama/api/chat endpoint, the OpenWebUI server proxies the request to ollama, but instead of returning the response in a streaming fashion expected by a client, it just dumps the entire stream back as one big response (including the newlines). This breaks clients that expect one little JSON chunk at a time.

Steps to Reproduce:

curl -s -XPOST -H "Content-Type: application/json" -H "Authorization: Bearer sk-apikeyhere" https://example.com/ollama/api/chat -d '{"model":"llama3:latest","messages":[{"role":"system","content":"Hello there"},{"role":"user","content":"hello"}],"stream":true,"options":{"temperature":1.0}}'

Expected Behavior:
A series of JSON values streamed back, on each line.

Actual Behavior:
One single HTTP response, containing all of the chunks (properly formatted on each line, though).

Environment

  • Open WebUI Version: 0.1.122

  • Ollama (if applicable): 0.1.32

  • Operating System: Docker Container (on Gentoo Linux)

Reproduction Details

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.

Logs and Screenshots

Installation Method

Docker

Additional Information

The response itself is not incorrect, but because it's not properly streamed, this will break clients (like aichat) that assume it will get the chunks streamed one by one.

I looked at the code, and it seems like the ollama proxy backend is supposed to handle streaming:

            def stream_content():
                try:
                    if form_data.stream:
                        yield json.dumps({"id": request_id, "done": False}) + "\n"

                    for chunk in r.iter_content(chunk_size=8192):
                        if request_id in REQUEST_POOL:
                            yield chunk
                        else:
                            log.warning("User: canceled request")
                            break
                finally:
                    if hasattr(r, "close"):
                        r.close()
                        if request_id in REQUEST_POOL:
                            REQUEST_POOL.remove(request_id)

            r = requests.request(
                method="POST",
                url=f"{url}/api/chat",
                data=form_data.model_dump_json(exclude_none=True).encode(),
                stream=True,
            )

            r.raise_for_status()

            return StreamingResponse(
                stream_content(),
                status_code=r.status_code,
                headers=dict(r.headers),
            )

This is from ollama main.py around line 875. But it doesn't seem to be respected? Maybe the form_data value isn't being set?

Originally created by @ProjectMoon on GitHub (May 1, 2024). # Bug Report ## Description **Bug Summary:** If you set the stream parameter to `true` on the `/ollama/api/chat` endpoint, the OpenWebUI server proxies the request to ollama, but instead of returning the response in a streaming fashion expected by a client, it just dumps the entire stream back as one big response (including the newlines). This breaks clients that expect one little JSON chunk at a time. **Steps to Reproduce:** ``` curl -s -XPOST -H "Content-Type: application/json" -H "Authorization: Bearer sk-apikeyhere" https://example.com/ollama/api/chat -d '{"model":"llama3:latest","messages":[{"role":"system","content":"Hello there"},{"role":"user","content":"hello"}],"stream":true,"options":{"temperature":1.0}}' ``` **Expected Behavior:** A series of JSON values streamed back, on each line. **Actual Behavior:** One single HTTP response, containing all of the chunks (properly formatted on each line, though). ## Environment - **Open WebUI Version:** 0.1.122 - **Ollama (if applicable):** 0.1.32 - **Operating System:** Docker Container (on Gentoo Linux) ## Reproduction Details **Confirmation:** - [X] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [X] I have included the browser console logs. - [X] I have included the Docker container logs. ## Logs and Screenshots ## Installation Method Docker ## Additional Information The response itself is not incorrect, but because it's not properly streamed, this will break clients (like aichat) that assume it will get the chunks streamed one by one. I looked at the code, and it seems like the ollama proxy backend is supposed to handle streaming: ```python def stream_content(): try: if form_data.stream: yield json.dumps({"id": request_id, "done": False}) + "\n" for chunk in r.iter_content(chunk_size=8192): if request_id in REQUEST_POOL: yield chunk else: log.warning("User: canceled request") break finally: if hasattr(r, "close"): r.close() if request_id in REQUEST_POOL: REQUEST_POOL.remove(request_id) r = requests.request( method="POST", url=f"{url}/api/chat", data=form_data.model_dump_json(exclude_none=True).encode(), stream=True, ) r.raise_for_status() return StreamingResponse( stream_content(), status_code=r.status_code, headers=dict(r.headers), ) ``` This is from ollama main.py around line 875. But it doesn't seem to be respected? Maybe the form_data value isn't being set?
Author
Owner

@ProjectMoon commented on GitHub (May 1, 2024):

Added some log statements in and around the stream_content() function. The stream value is set fine on form_data, but for whatever reason, stream_content() seems to not be called, or is not executed.

@ProjectMoon commented on GitHub (May 1, 2024): Added some log statements in and around the `stream_content()` function. The stream value is set fine on `form_data`, but for whatever reason, `stream_content()` seems to not be called, or is not executed.
Author
Owner

@ProjectMoon commented on GitHub (May 1, 2024):

Bit more debugging. It is actually hitting the stream clause, but for whatever reason this still results in one large response back to the client.

@ProjectMoon commented on GitHub (May 1, 2024): Bit more debugging. It is actually hitting the stream clause, but for whatever reason this still results in one large response back to the client.
Author
Owner

@cheahjs commented on GitHub (May 1, 2024):

Unable to reproduce, are you running Open WebUI behind a reverse proxy that is buffering responses?

@cheahjs commented on GitHub (May 1, 2024): Unable to reproduce, are you running Open WebUI behind a reverse proxy that is buffering responses?
Author
Owner

@ProjectMoon commented on GitHub (May 1, 2024):

That is actually a very good point. It is running through a Cloudflare Tunnel.

@ProjectMoon commented on GitHub (May 1, 2024): That is actually a very good point. It is running through a Cloudflare Tunnel.
Author
Owner

@ProjectMoon commented on GitHub (May 1, 2024):

One possibility, though, is it possible to get Cloudflared to not buffer? Only thing I can find is if the response header has a specific text/event-stream content type... otherwise it seems to buffer.

@ProjectMoon commented on GitHub (May 1, 2024): One possibility, though, is it possible to get Cloudflared to not buffer? Only thing I can find is if the response header has a specific `text/event-stream` content type... otherwise it seems to buffer.
Author
Owner

@anatoliykmetyuk commented on GitHub (May 7, 2024):

@ProjectMoon have you managed to find a workaround on how to enable streaming via Cloudflare?

@anatoliykmetyuk commented on GitHub (May 7, 2024): @ProjectMoon have you managed to find a workaround on how to enable streaming via Cloudflare?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#778