[GH-ISSUE #15329] Report on Issues with UI Interaction with Ollama #9805

New Issue

GiteaMirror · 2026-04-12T22:40:47-05:00

GiteaMirror commented

2026-04-12 22:40:47 -05:00

Originally created by @DjceUo on GitHub (Apr 4, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15329

What is the issue?

Executive Summary
During testing, systematic failures were observed when using Ollama through UI clients (Chatbox, OpenWebUI).
With identical model parameters and identical prompts, some UI clients do not receive a response, despite the fact that:
• Ollama successfully performs inference
• GPU utilization reaches 100%
• CPU shows a typical compute load pattern
• responses via CLI are consistently returned without delay
This indicates a problem at the API interaction layer between UI clients and Ollama, rather than an issue with the models or hardware.
The problem is reproducible across multiple models, which rules out issues related to specific quantizations or architectures.
Additionally, it was observed that behavior becomes unstable even with a 32k context window, while previous-generation models handle significantly larger context windows reliably. This may indicate issues related to streaming response handling or context management.

Test Conditions
Parameters:
• identical prompt
• identical model settings
• context window = 32k
• no system configuration changes between runs
• identical hardware
• execution via:
o CLI (ollama run)
o Ollama API
o Chatbox
o OpenWebUI
Test prompt:
Explain how a quantum computer works

Observed Anomaly
In multiple cases:
• GPU reaches 100% utilization
• CPU initially shows high load, then decreases
• inference is clearly performed by Ollama
• UI does not receive token stream
• UI continues waiting until GPU utilization drops to zero
• no response is displayed
At the same time, CLI works correctly.
This is a typical symptom of:
• streaming connection interruption
• chunked response processing errors
• keep-alive connection issues
• incorrect handling of SSE (server-sent events)
• client waiting indefinitely for final token
• incorrect handling of eval_duration / prompt_eval_duration

Test Results
GLM-4.7 q6 flash
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox GPU 100%, no response
OpenWebUI delayed start of generation

gemma4:31b-it-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API ~1 second delay
Chatbox CPU 70% → 15-30%, no response
OpenWebUI CPU 70% → 15-30%, no response
(result consistently reproducible)

Qwen3.5-9b q8
interface behavior
CLI high CPU usage, no response
Ollama API generation starts immediately
Chatbox ~5 second delay, high CPU usage
OpenWebUI generation starts immediately

qwen3.5:35b-a3b-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox GPU 100%, no response
OpenWebUI ~5 second delay, high CPU usage

qwen3.5:27b-q4_K_M
interface behavior
CLI generation starts immediately
Ollama API generation starts immediately
Chatbox no response
OpenWebUI ~2 second delay

Conclusion
Recurring issues observed:

UI clients do not receive token streams despite successful inference in Ollama
some clients remain waiting until inference is fully completed
problem reproduces across different models
problem reproduces across different quantizations
CLI operates correctly
Ollama API operates correctly
failures occur only when using UI clients
This indicates a likely issue related to:
• Ollama streaming API
• chunked transfer encoding handling
• token streaming with long context windows
• reasoning token handling
• connection timeouts
• incorrect stop sequence handling
• incorrect handling of stream=true parameter
• differences in handling reasoning models

Items Recommended for Investigation
API layer
• correctness of SSE streaming implementation
• stream completion handling for long responses
• consistency between CLI and HTTP API behavior
• correctness of Content-Length / Transfer-Encoding handling
• buffer flushing behavior
• keep-alive connection stability
client layer
• correct handling of partial tokens
• reasoning token handling
• behavior when model emits reasoning tokens before final answer
• handling of stream completion events
• response timeout handling
parameters
• impact of context window = 32k
• impact of eval_duration
• impact of prompt_eval_duration
• behavior of reasoning models (Qwen3.5 family)
• KV cache size impact

Why This Matters
In the current state, using Ollama through UI clients is:
• unstable
• unpredictable
• creates the impression that models freeze
• complicates integration into enterprise interfaces
• slows adoption of local LLM infrastructure
CLI operation remains stable, confirming that the inference pipeline itself functions correctly.

Relevant log output

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

Originally created by @DjceUo on GitHub (Apr 4, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15329 ### What is the issue? Executive Summary During testing, systematic failures were observed when using Ollama through UI clients (Chatbox, OpenWebUI). With identical model parameters and identical prompts, some UI clients do not receive a response, despite the fact that: • Ollama successfully performs inference • GPU utilization reaches 100% • CPU shows a typical compute load pattern • responses via CLI are consistently returned without delay This indicates a problem at the API interaction layer between UI clients and Ollama, rather than an issue with the models or hardware. The problem is reproducible across multiple models, which rules out issues related to specific quantizations or architectures. Additionally, it was observed that behavior becomes unstable even with a 32k context window, while previous-generation models handle significantly larger context windows reliably. This may indicate issues related to streaming response handling or context management. Test Conditions Parameters: • identical prompt • identical model settings • context window = 32k • no system configuration changes between runs • identical hardware • execution via: o CLI (ollama run) o Ollama API o Chatbox o OpenWebUI Test prompt: Explain how a quantum computer works Observed Anomaly In multiple cases: • GPU reaches 100% utilization • CPU initially shows high load, then decreases • inference is clearly performed by Ollama • UI does not receive token stream • UI continues waiting until GPU utilization drops to zero • no response is displayed At the same time, CLI works correctly. This is a typical symptom of: • streaming connection interruption • chunked response processing errors • keep-alive connection issues • incorrect handling of SSE (server-sent events) • client waiting indefinitely for final token • incorrect handling of eval_duration / prompt_eval_duration Test Results GLM-4.7 q6 flash interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI delayed start of generation gemma4:31b-it-q4_K_M interface behavior CLI generation starts immediately Ollama API ~1 second delay Chatbox CPU 70% → 15-30%, no response OpenWebUI CPU 70% → 15-30%, no response (result consistently reproducible) Qwen3.5-9b q8 interface behavior CLI high CPU usage, no response Ollama API generation starts immediately Chatbox ~5 second delay, high CPU usage OpenWebUI generation starts immediately qwen3.5:35b-a3b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI ~5 second delay, high CPU usage qwen3.5:27b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox no response OpenWebUI ~2 second delay Conclusion Recurring issues observed: 1. UI clients do not receive token streams despite successful inference in Ollama 2. some clients remain waiting until inference is fully completed 3. problem reproduces across different models 4. problem reproduces across different quantizations 5. CLI operates correctly 6. Ollama API operates correctly 7. failures occur only when using UI clients This indicates a likely issue related to: • Ollama streaming API • chunked transfer encoding handling • token streaming with long context windows • reasoning token handling • connection timeouts • incorrect stop sequence handling • incorrect handling of stream=true parameter • differences in handling reasoning models Items Recommended for Investigation API layer • correctness of SSE streaming implementation • stream completion handling for long responses • consistency between CLI and HTTP API behavior • correctness of Content-Length / Transfer-Encoding handling • buffer flushing behavior • keep-alive connection stability client layer • correct handling of partial tokens • reasoning token handling • behavior when model emits reasoning tokens before final answer • handling of stream completion events • response timeout handling parameters • impact of context window = 32k • impact of eval_duration • impact of prompt_eval_duration • behavior of reasoning models (Qwen3.5 family) • KV cache size impact Why This Matters In the current state, using Ollama through UI clients is: • unstable • unpredictable • creates the impression that models freeze • complicates integration into enterprise interfaces • slows adoption of local LLM infrastructure CLI operation remains stable, confirming that the inference pipeline itself functions correctly. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.20.0

GiteaMirror added the bug label 2026-04-12 22:40:47 -05:00

GiteaMirror commented

2026-04-12 22:40:48 -05:00

@rick-github commented on GitHub (Apr 4, 2026):

Server logs will aid in debugging.

@rick-github commented on GitHub (Apr 4, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.

GiteaMirror commented

2026-04-12 22:40:48 -05:00

@mario-grgic commented on GitHub (Apr 4, 2026):

server.log

Observing exact same issue with ollama 0.20.2, using Gemma 4:26b and Open WebUI (0.8.12). I have large context (250000 tokens). Every interaction starts failing on 4th "turn", i.e. ask a question, get response, ask follow up question etc. On 4th turn I only get "Thought for x seconds" with no response.

On occasion I get garbled output with no "Thinking" tag at all.

Does not happen in CLI.

Example conversation attached.

Conversation 1.md

@mario-grgic commented on GitHub (Apr 4, 2026): [server.log](https://github.com/user-attachments/files/26481787/server.log) Observing exact same issue with ollama 0.20.2, using Gemma 4:26b and Open WebUI (0.8.12). I have large context (250000 tokens). Every interaction starts failing on 4th "turn", i.e. ask a question, get response, ask follow up question etc. On 4th turn I only get "Thought for x seconds" with no response. On occasion I get garbled output with no "Thinking" tag at all. Does not happen in CLI. Example conversation attached. [Conversation 1.md](https://github.com/user-attachments/files/26481749/Conversation.1.md)

GiteaMirror commented

2026-04-12 22:40:49 -05:00

@DjceUo commented on GitHub (Apr 7, 2026):

A new version just dropped, and not a single issue got fixed. On top of that, models that used to work fine in Chatbox and OpenWebUI are now basically broken — they don't work at all anymore.

@DjceUo commented on GitHub (Apr 7, 2026): A new version just dropped, and not a single issue got fixed. On top of that, models that used to work fine in Chatbox and OpenWebUI are now basically broken — they don't work at all anymore.

GiteaMirror commented

2026-04-12 22:40:49 -05:00

@rick-github commented on GitHub (Apr 7, 2026):

Server logs will aid in debugging.

@rick-github commented on GitHub (Apr 7, 2026): [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.

GiteaMirror commented

2026-04-12 22:40:49 -05:00

@mario-grgic commented on GitHub (Apr 8, 2026):

I have turned on trace logging and reproduced a problem where Gemma 4 displays "Thought for 17 seconds" with no output. If you expand the "Thought" section in Open WebUI, you see that the section also contains what the model should have output to the user nicely formatted.

Compressed server.log attached.

server_log.zip

@mario-grgic commented on GitHub (Apr 8, 2026): I have turned on trace logging and reproduced a problem where Gemma 4 displays "Thought for 17 seconds" with no output. If you expand the "Thought" section in Open WebUI, you see that the section also contains what the model should have output to the user nicely formatted. Compressed server.log attached. [server_log.zip](https://github.com/user-attachments/files/26557504/server_log.zip)

GiteaMirror commented

2026-04-12 22:40:50 -05:00

@mario-grgic commented on GitHub (Apr 8, 2026):

Here is another example of the above with debug logging only (trace log is huge).

server.log

@mario-grgic commented on GitHub (Apr 8, 2026): Here is another example of the above with debug logging only (trace log is huge). [server.log](https://github.com/user-attachments/files/26557678/server.log)

GiteaMirror referenced this issue

2026-04-22 13:11:33 -05:00

[GH-ISSUE #9805] problem with model after update to 0.6.1 #32174

GiteaMirror referenced this issue

2026-04-29 01:24:56 -05:00

[GH-ISSUE #9805] problem with model after update to 0.6.1 #52926

GiteaMirror referenced this issue

2026-05-04 14:05:40 -05:00

[GH-ISSUE #9805] problem with model after update to 0.6.1 #68471

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#9805