[GH-ISSUE #15329] Report on Issues with UI Interaction with Ollama #9805

Open
opened 2026-04-12 22:40:47 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @DjceUo on GitHub (Apr 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15329

What is the issue?

Executive Summary
During testing, systematic failures were observed when using Ollama through UI clients (Chatbox, OpenWebUI).
With identical model parameters and identical prompts, some UI clients do not receive a response, despite the fact that:
• Ollama successfully performs inference
• GPU utilization reaches 100%
• CPU shows a typical compute load pattern
• responses via CLI are consistently returned without delay
This indicates a problem at the API interaction layer between UI clients and Ollama, rather than an issue with the models or hardware.
The problem is reproducible across multiple models, which rules out issues related to specific quantizations or architectures.
Additionally, it was observed that behavior becomes unstable even with a 32k context window, while previous-generation models handle significantly larger context windows reliably. This may indicate issues related to streaming response handling or context management.

Test Conditions
Parameters:
• identical prompt
• identical model settings
• context window = 32k
• no system configuration changes between runs
• identical hardware
• execution via:
o CLI (ollama run)
o Ollama API
o Chatbox
o OpenWebUI
Test prompt:
Explain how a quantum computer works

Observed Anomaly
In multiple cases:
• GPU reaches 100% utilization
• CPU initially shows high load, then decreases
• inference is clearly performed by Ollama
• UI does not receive token stream
• UI continues waiting until GPU utilization drops to zero
• no response is displayed
At the same time, CLI works correctly.
This is a typical symptom of:
• streaming connection interruption
• chunked response processing errors
• keep-alive connection issues
• incorrect handling of SSE (server-sent events)
• client waiting indefinitely for final token
• incorrect handling of eval_duration / prompt_eval_duration

Test Results
GLM-4.7 q6 flash
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox GPU 100%, no response
OpenWebUI delayed start of generation

gemma4:31b-it-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API ~1 second delay
Chatbox CPU 70% → 15-30%, no response
OpenWebUI CPU 70% → 15-30%, no response
(result consistently reproducible)

Qwen3.5-9b q8
interface behavior
CLI high CPU usage, no response
Ollama API generation starts immediately
Chatbox ~5 second delay, high CPU usage
OpenWebUI generation starts immediately

qwen3.5:35b-a3b-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox GPU 100%, no response
OpenWebUI ~5 second delay, high CPU usage

qwen3.5:27b-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox no response
OpenWebUI ~2 second delay

Conclusion
Recurring issues observed:

  1. UI clients do not receive token streams despite successful inference in Ollama
  2. some clients remain waiting until inference is fully completed
  3. problem reproduces across different models
  4. problem reproduces across different quantizations
  5. CLI operates correctly
  6. Ollama API operates correctly
  7. failures occur only when using UI clients
    This indicates a likely issue related to:
    • Ollama streaming API
    • chunked transfer encoding handling
    • token streaming with long context windows
    • reasoning token handling
    • connection timeouts
    • incorrect stop sequence handling
    • incorrect handling of stream=true parameter
    • differences in handling reasoning models

Items Recommended for Investigation
API layer
• correctness of SSE streaming implementation
• stream completion handling for long responses
• consistency between CLI and HTTP API behavior
• correctness of Content-Length / Transfer-Encoding handling
• buffer flushing behavior
• keep-alive connection stability
client layer
• correct handling of partial tokens
• reasoning token handling
• behavior when model emits reasoning tokens before final answer
• handling of stream completion events
• response timeout handling
parameters
• impact of context window = 32k
• impact of eval_duration
• impact of prompt_eval_duration
• behavior of reasoning models (Qwen3.5 family)
• KV cache size impact

Why This Matters
In the current state, using Ollama through UI clients is:
• unstable
• unpredictable
• creates the impression that models freeze
• complicates integration into enterprise interfaces
• slows adoption of local LLM infrastructure
CLI operation remains stable, confirming that the inference pipeline itself functions correctly.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

Originally created by @DjceUo on GitHub (Apr 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15329 ### What is the issue? Executive Summary During testing, systematic failures were observed when using Ollama through UI clients (Chatbox, OpenWebUI). With identical model parameters and identical prompts, some UI clients do not receive a response, despite the fact that: • Ollama successfully performs inference • GPU utilization reaches 100% • CPU shows a typical compute load pattern • responses via CLI are consistently returned without delay This indicates a problem at the API interaction layer between UI clients and Ollama, rather than an issue with the models or hardware. The problem is reproducible across multiple models, which rules out issues related to specific quantizations or architectures. Additionally, it was observed that behavior becomes unstable even with a 32k context window, while previous-generation models handle significantly larger context windows reliably. This may indicate issues related to streaming response handling or context management. Test Conditions Parameters: • identical prompt • identical model settings • context window = 32k • no system configuration changes between runs • identical hardware • execution via: o CLI (ollama run) o Ollama API o Chatbox o OpenWebUI Test prompt: Explain how a quantum computer works Observed Anomaly In multiple cases: • GPU reaches 100% utilization • CPU initially shows high load, then decreases • inference is clearly performed by Ollama • UI does not receive token stream • UI continues waiting until GPU utilization drops to zero • no response is displayed At the same time, CLI works correctly. This is a typical symptom of: • streaming connection interruption • chunked response processing errors • keep-alive connection issues • incorrect handling of SSE (server-sent events) • client waiting indefinitely for final token • incorrect handling of eval_duration / prompt_eval_duration Test Results GLM-4.7 q6 flash interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI delayed start of generation gemma4:31b-it-q4_K_M interface behavior CLI generation starts immediately Ollama API ~1 second delay Chatbox CPU 70% → 15-30%, no response OpenWebUI CPU 70% → 15-30%, no response (result consistently reproducible) Qwen3.5-9b q8 interface behavior CLI high CPU usage, no response Ollama API generation starts immediately Chatbox ~5 second delay, high CPU usage OpenWebUI generation starts immediately qwen3.5:35b-a3b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI ~5 second delay, high CPU usage qwen3.5:27b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox no response OpenWebUI ~2 second delay Conclusion Recurring issues observed: 1. UI clients do not receive token streams despite successful inference in Ollama 2. some clients remain waiting until inference is fully completed 3. problem reproduces across different models 4. problem reproduces across different quantizations 5. CLI operates correctly 6. Ollama API operates correctly 7. failures occur only when using UI clients This indicates a likely issue related to: • Ollama streaming API • chunked transfer encoding handling • token streaming with long context windows • reasoning token handling • connection timeouts • incorrect stop sequence handling • incorrect handling of stream=true parameter • differences in handling reasoning models Items Recommended for Investigation API layer • correctness of SSE streaming implementation • stream completion handling for long responses • consistency between CLI and HTTP API behavior • correctness of Content-Length / Transfer-Encoding handling • buffer flushing behavior • keep-alive connection stability client layer • correct handling of partial tokens • reasoning token handling • behavior when model emits reasoning tokens before final answer • handling of stream completion events • response timeout handling parameters • impact of context window = 32k • impact of eval_duration • impact of prompt_eval_duration • behavior of reasoning models (Qwen3.5 family) • KV cache size impact Why This Matters In the current state, using Ollama through UI clients is: • unstable • unpredictable • creates the impression that models freeze • complicates integration into enterprise interfaces • slows adoption of local LLM infrastructure CLI operation remains stable, confirming that the inference pipeline itself functions correctly. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.20.0
GiteaMirror added the bug label 2026-04-12 22:40:47 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 4, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4187285984 --> @rick-github commented on GitHub (Apr 4, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@mario-grgic commented on GitHub (Apr 4, 2026):

server.log

Observing exact same issue with ollama 0.20.2, using Gemma 4:26b and Open WebUI (0.8.12). I have large context (250000 tokens). Every interaction starts failing on 4th "turn", i.e. ask a question, get response, ask follow up question etc. On 4th turn I only get "Thought for x seconds" with no response.

On occasion I get garbled output with no "Thinking" tag at all.

Does not happen in CLI.

Example conversation attached.

Conversation 1.md

<!-- gh-comment-id:4187333727 --> @mario-grgic commented on GitHub (Apr 4, 2026): [server.log](https://github.com/user-attachments/files/26481787/server.log) Observing exact same issue with ollama 0.20.2, using Gemma 4:26b and Open WebUI (0.8.12). I have large context (250000 tokens). Every interaction starts failing on 4th "turn", i.e. ask a question, get response, ask follow up question etc. On 4th turn I only get "Thought for x seconds" with no response. On occasion I get garbled output with no "Thinking" tag at all. Does not happen in CLI. Example conversation attached. [Conversation 1.md](https://github.com/user-attachments/files/26481749/Conversation.1.md)
Author
Owner

@DjceUo commented on GitHub (Apr 7, 2026):

A new version just dropped, and not a single issue got fixed. On top of that, models that used to work fine in Chatbox and OpenWebUI are now basically broken — they don't work at all anymore.

<!-- gh-comment-id:4197838554 --> @DjceUo commented on GitHub (Apr 7, 2026): A new version just dropped, and not a single issue got fixed. On top of that, models that used to work fine in Chatbox and OpenWebUI are now basically broken — they don't work at all anymore.
Author
Owner

@rick-github commented on GitHub (Apr 7, 2026):

Server logs will aid in debugging.

<!-- gh-comment-id:4197917209 --> @rick-github commented on GitHub (Apr 7, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@mario-grgic commented on GitHub (Apr 8, 2026):

I have turned on trace logging and reproduced a problem where Gemma 4 displays "Thought for 17 seconds" with no output. If you expand the "Thought" section in Open WebUI, you see that the section also contains what the model should have output to the user nicely formatted.

Compressed server.log attached.

server_log.zip

<!-- gh-comment-id:4203558483 --> @mario-grgic commented on GitHub (Apr 8, 2026): I have turned on trace logging and reproduced a problem where Gemma 4 displays "Thought for 17 seconds" with no output. If you expand the "Thought" section in Open WebUI, you see that the section also contains what the model should have output to the user nicely formatted. Compressed server.log attached. [server_log.zip](https://github.com/user-attachments/files/26557504/server_log.zip)
Author
Owner

@mario-grgic commented on GitHub (Apr 8, 2026):

Here is another example of the above with debug logging only (trace log is huge).

server.log

<!-- gh-comment-id:4203596459 --> @mario-grgic commented on GitHub (Apr 8, 2026): Here is another example of the above with debug logging only (trace log is huge). [server.log](https://github.com/user-attachments/files/26557678/server.log)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9805