[GH-ISSUE #14681] issue: Current model set as Task model is reloaded to generate title with different keep alive parameter #17332

New Issue

GiteaMirror · 2026-04-19T23:04:11-05:00

GiteaMirror commented

2026-04-19 23:04:11 -05:00

Originally created by @trinhkvo on GitHub (Jun 5, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/14681

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.13

Ollama Version (if applicable)

v0.6.8

Operating System

Windows 11

Browser (if applicable)

No response

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When the current model is set as Task model, after the first response in a new chat, the title should be generated immediately without Ollama having to reload the model.

Actual Behavior

After the first response in a new chat, the current model is always reloaded to generate title.

Steps to Reproduce

Set keep alive parameter in OWUI as -1m
Run Ollama docker with default keep alive parameter (5m)
Open a new chat in OWUI and submit a query
Once response is streaming (i.e., model is loaded), Ollama ps shows that the model is kept until forever (as expected from the OWUI setting)
After response is completed, check Ollama log and see that the model is being reloaded for title generation
After title is generated, Ollama ps shows that the model is kept alive for 5 minutes (i.e., Ollama default)

Logs & Screenshots

2025-06-04 23:36:02.164 | 2025/06/05 03:36:02 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:15m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
2025-06-04 23:36:02.347 | time=2025-06-05T03:36:02.347Z level=INFO source=images.go:463 msg="total blobs: 41"
2025-06-04 23:36:02.455 | time=2025-06-05T03:36:02.454Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
2025-06-04 23:36:02.562 | time=2025-06-05T03:36:02.562Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
2025-06-04 23:36:02.564 | time=2025-06-05T03:36:02.564Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
2025-06-04 23:36:02.843 | time=2025-06-05T03:36:02.842Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fb65d1cd-e129-e457-f1f2-c11081edf878 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
2025-06-04 23:36:02.843 | time=2025-06-05T03:36:02.842Z level=INFO source=types.go:130 msg="inference compute" id=GPU-9d96376a-8917-7f4e-e3c4-03408ac757ee library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB"
2025-06-04 23:36:28.057 | [GIN] 2025/06/05 - 03:36:28 | 200 | 144.888519ms | 172.18.0.5 | GET "/api/tags"
2025-06-04 23:36:28.265 | [GIN] 2025/06/05 - 03:36:28 | 200 | 637.548µs | 172.18.0.5 | GET "/api/ps"
2025-06-04 23:36:29.334 | [GIN] 2025/06/05 - 03:36:29 | 200 | 28.571µs | 172.18.0.5 | GET "/api/version"
2025-06-04 23:36:32.330 | [GIN] 2025/06/05 - 03:36:32 | 200 | 37.841µs | 172.18.0.5 | GET "/api/version"
2025-06-04 23:36:33.700 | [GIN] 2025/06/05 - 03:36:33 | 200 | 120.365057ms | 172.18.0.5 | GET "/api/tags"
2025-06-04 23:36:33.702 | [GIN] 2025/06/05 - 03:36:33 | 200 | 22.01µs | 172.18.0.5 | GET "/api/ps"
2025-06-04 23:36:33.992 | [GIN] 2025/06/05 - 03:36:33 | 200 | 125.537912ms | 172.18.0.5 | GET "/api/tags"
2025-06-04 23:36:33.994 | [GIN] 2025/06/05 - 03:36:33 | 200 | 21.15µs | 172.18.0.5 | GET "/api/ps"
2025-06-04 23:36:42.892 | time=2025-06-05T03:36:42.892Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.188 | time=2025-06-05T03:36:43.188Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.244 | time=2025-06-05T03:36:43.244Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.246 | time=2025-06-05T03:36:43.246Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.470 | time=2025-06-05T03:36:43.470Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.703 | time=2025-06-05T03:36:43.703Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:43.925 | time=2025-06-05T03:36:43.925Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:44.146 | time=2025-06-05T03:36:44.146Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:44.364 | time=2025-06-05T03:36:44.364Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:44.803 | time=2025-06-05T03:36:44.803Z level=INFO source=server.go:106 msg="system memory" total="47.0 GiB" free="39.8 GiB" free_swap="12.0 GiB"
2025-06-04 23:36:44.805 | time=2025-06-05T03:36:44.805Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:45.028 | time=2025-06-05T03:36:45.028Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=32,30 memory.available="[11.0 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="21.6 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[10.8 GiB 10.8 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB" projector.weights="818.0 MiB" projector.graph="0 B"
2025-06-04 23:36:45.028 | time=2025-06-05T03:36:45.028Z level=INFO source=server.go:186 msg="enabling flash attention"
2025-06-04 23:36:45.097 | time=2025-06-05T03:36:45.097Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:45.098 | time=2025-06-05T03:36:45.098Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
2025-06-04 23:36:45.108 | time=2025-06-05T03:36:45.108Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 62 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,30 --port 34837"
2025-06-04 23:36:45.109 | time=2025-06-05T03:36:45.108Z level=INFO source=sched.go:452 msg="loaded runners" count=1
2025-06-04 23:36:45.109 | time=2025-06-05T03:36:45.109Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
2025-06-04 23:36:45.110 | time=2025-06-05T03:36:45.110Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
2025-06-04 23:36:45.120 | time=2025-06-05T03:36:45.120Z level=INFO source=runner.go:851 msg="starting ollama engine"
2025-06-04 23:36:45.131 | time=2025-06-05T03:36:45.131Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:34837"
2025-06-04 23:36:45.201 | time=2025-06-05T03:36:45.201Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:36:45.202 | time=2025-06-05T03:36:45.202Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
2025-06-04 23:36:45.202 | time=2025-06-05T03:36:45.202Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="Gemma 3 27b It Qat" description="" num_tensors=808 num_key_values=45
2025-06-04 23:36:45.224 | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
2025-06-04 23:36:45.362 | time=2025-06-05T03:36:45.362Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
2025-06-04 23:36:45.573 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2025-06-04 23:36:45.573 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2025-06-04 23:36:45.573 | ggml_cuda_init: found 2 CUDA devices:
2025-06-04 23:36:45.573 | Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2025-06-04 23:36:45.573 | Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2025-06-04 23:36:45.689 | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025-06-04 23:36:45.689 | time=2025-06-05T03:36:45.689Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.2 GiB"
2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.0 GiB"
2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.5 GiB"
2025-06-04 23:37:02.601 | [GIN] 2025/06/05 - 03:37:02 | 200 | 1.585813342s | 172.18.0.5 | GET "/api/tags"
2025-06-04 23:37:02.603 | [GIN] 2025/06/05 - 03:37:02 | 200 | 47.182µs | 172.18.0.5 | GET "/api/ps"
2025-06-04 23:37:04.581 | [GIN] 2025/06/05 - 03:37:04 | 200 | 1.475773577s | 172.18.0.5 | GET "/api/tags"
2025-06-04 23:37:04.583 | [GIN] 2025/06/05 - 03:37:04 | 200 | 41.021µs | 172.18.0.5 | GET "/api/ps"
2025-06-04 23:37:23.093 | time=2025-06-05T03:37:23.092Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="154.5 MiB"
2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="138.5 MiB"
2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
2025-06-04 23:37:23.761 | time=2025-06-05T03:37:23.761Z level=INFO source=server.go:628 msg="llama runner started in 38.65 seconds"
2025-06-04 23:38:08.008 | [GIN] 2025/06/05 - 03:38:08 | 200 | 1m25s | 172.18.0.5 | POST "/api/chat"
2025-06-04 23:38:08.220 | time=2025-06-05T03:38:08.219Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:13.258 | time=2025-06-05T03:38:13.258Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.036824613 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42
2025-06-04 23:38:13.534 | time=2025-06-05T03:38:13.534Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.313355817 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42
2025-06-04 23:38:13.761 | time=2025-06-05T03:38:13.761Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.540399162 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42
2025-06-04 23:38:14.019 | time=2025-06-05T03:38:14.019Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.076 | time=2025-06-05T03:38:14.076Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.079 | time=2025-06-05T03:38:14.079Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.301 | time=2025-06-05T03:38:14.301Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.523 | time=2025-06-05T03:38:14.523Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.763 | time=2025-06-05T03:38:14.763Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:14.979 | time=2025-06-05T03:38:14.979Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:15.202 | time=2025-06-05T03:38:15.202Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:15.635 | time=2025-06-05T03:38:15.635Z level=INFO source=server.go:106 msg="system memory" total="47.0 GiB" free="39.7 GiB" free_swap="12.0 GiB"
2025-06-04 23:38:15.637 | time=2025-06-05T03:38:15.637Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:15.853 | time=2025-06-05T03:38:15.853Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=32,30 memory.available="[11.0 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="21.6 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[10.8 GiB 10.8 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB" projector.weights="818.0 MiB" projector.graph="0 B"
2025-06-04 23:38:15.853 | time=2025-06-05T03:38:15.853Z level=INFO source=server.go:186 msg="enabling flash attention"
2025-06-04 23:38:15.918 | time=2025-06-05T03:38:15.918Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:15.919 | time=2025-06-05T03:38:15.919Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
2025-06-04 23:38:15.928 | time=2025-06-05T03:38:15.928Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 62 --threads 8 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 32,30 --port 45577"
2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.928Z level=INFO source=sched.go:452 msg="loaded runners" count=1
2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.928Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.929Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
2025-06-04 23:38:15.942 | time=2025-06-05T03:38:15.942Z level=INFO source=runner.go:851 msg="starting ollama engine"
2025-06-04 23:38:15.953 | time=2025-06-05T03:38:15.953Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:45577"
2025-06-04 23:38:16.020 | time=2025-06-05T03:38:16.019Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
2025-06-04 23:38:16.021 | time=2025-06-05T03:38:16.020Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
2025-06-04 23:38:16.021 | time=2025-06-05T03:38:16.020Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="Gemma 3 27b It Qat" description="" num_tensors=808 num_key_values=45
2025-06-04 23:38:16.025 | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
2025-06-04 23:38:16.108 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2025-06-04 23:38:16.108 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2025-06-04 23:38:16.108 | ggml_cuda_init: found 2 CUDA devices:
2025-06-04 23:38:16.108 | Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2025-06-04 23:38:16.108 | Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2025-06-04 23:38:16.180 | time=2025-06-05T03:38:16.180Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
2025-06-04 23:38:16.201 | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025-06-04 23:38:16.201 | time=2025-06-05T03:38:16.201Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.0 GiB"
2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.5 GiB"
2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.2 GiB"
2025-06-04 23:38:54.231 | time=2025-06-05T03:38:54.231Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0
2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0
2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="154.5 MiB"
2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="138.5 MiB"
2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"
2025-06-04 23:38:55.077 | time=2025-06-05T03:38:55.077Z level=INFO source=server.go:628 msg="llama runner started in 39.15 seconds"
2025-06-04 23:38:58.031 | [GIN] 2025/06/05 - 03:38:58 | 200 | 49.997572682s | 172.18.0.5 | POST "/api/chat"
2025-06-04 23:39:17.494 | [GIN] 2025/06/05 - 03:39:17 | 200 | 22.19µs | 127.0.0.1 | HEAD "/"
2025-06-04 23:39:17.494 | [GIN] 2025/06/05 - 03:39:17 | 200 | 20.351µs | 127.0.0.1 | GET "/api/ps"

Additional Information

I suspect there may be a mismatch in parameters used in the title generation task, leading to the current model being reloaded. I'm not sure if it's related to the keep_alive parameter, because even when I run Ollama container with the OLLAMA_KEEP_ALIVE=-1 variable, the model is also reloaded for title generation.

Originally created by @trinhkvo on GitHub (Jun 5, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/14681 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.6.13 ### Ollama Version (if applicable) v0.6.8 ### Operating System Windows 11 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When the current model is set as Task model, after the first response in a new chat, the title should be generated immediately without Ollama having to reload the model. ### Actual Behavior After the first response in a new chat, the current model is always reloaded to generate title. ### Steps to Reproduce - Set keep alive parameter in OWUI as -1m - Run Ollama docker with default keep alive parameter (5m) - Open a new chat in OWUI and submit a query - Once response is streaming (i.e., model is loaded), Ollama ps shows that the model is kept until forever (as expected from the OWUI setting) - After response is completed, check Ollama log and see that the model is being reloaded for title generation - After title is generated, Ollama ps shows that the model is kept alive for 5 minutes (i.e., Ollama default) ### Logs & Screenshots 2025-06-04 23:36:02.164 | 2025/06/05 03:36:02 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:15m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" 2025-06-04 23:36:02.347 | time=2025-06-05T03:36:02.347Z level=INFO source=images.go:463 msg="total blobs: 41" 2025-06-04 23:36:02.455 | time=2025-06-05T03:36:02.454Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" 2025-06-04 23:36:02.562 | time=2025-06-05T03:36:02.562Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" 2025-06-04 23:36:02.564 | time=2025-06-05T03:36:02.564Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" 2025-06-04 23:36:02.843 | time=2025-06-05T03:36:02.842Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fb65d1cd-e129-e457-f1f2-c11081edf878 library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" 2025-06-04 23:36:02.843 | time=2025-06-05T03:36:02.842Z level=INFO source=types.go:130 msg="inference compute" id=GPU-9d96376a-8917-7f4e-e3c4-03408ac757ee library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3060" total="12.0 GiB" available="11.0 GiB" 2025-06-04 23:36:28.057 | [GIN] 2025/06/05 - 03:36:28 | 200 | 144.888519ms | 172.18.0.5 | GET "/api/tags" 2025-06-04 23:36:28.265 | [GIN] 2025/06/05 - 03:36:28 | 200 | 637.548µs | 172.18.0.5 | GET "/api/ps" 2025-06-04 23:36:29.334 | [GIN] 2025/06/05 - 03:36:29 | 200 | 28.571µs | 172.18.0.5 | GET "/api/version" 2025-06-04 23:36:32.330 | [GIN] 2025/06/05 - 03:36:32 | 200 | 37.841µs | 172.18.0.5 | GET "/api/version" 2025-06-04 23:36:33.700 | [GIN] 2025/06/05 - 03:36:33 | 200 | 120.365057ms | 172.18.0.5 | GET "/api/tags" 2025-06-04 23:36:33.702 | [GIN] 2025/06/05 - 03:36:33 | 200 | 22.01µs | 172.18.0.5 | GET "/api/ps" 2025-06-04 23:36:33.992 | [GIN] 2025/06/05 - 03:36:33 | 200 | 125.537912ms | 172.18.0.5 | GET "/api/tags" 2025-06-04 23:36:33.994 | [GIN] 2025/06/05 - 03:36:33 | 200 | 21.15µs | 172.18.0.5 | GET "/api/ps" 2025-06-04 23:36:42.892 | time=2025-06-05T03:36:42.892Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.188 | time=2025-06-05T03:36:43.188Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.244 | time=2025-06-05T03:36:43.244Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.246 | time=2025-06-05T03:36:43.246Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.470 | time=2025-06-05T03:36:43.470Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.703 | time=2025-06-05T03:36:43.703Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:43.925 | time=2025-06-05T03:36:43.925Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:44.146 | time=2025-06-05T03:36:44.146Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:44.364 | time=2025-06-05T03:36:44.364Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:44.803 | time=2025-06-05T03:36:44.803Z level=INFO source=server.go:106 msg="system memory" total="47.0 GiB" free="39.8 GiB" free_swap="12.0 GiB" 2025-06-04 23:36:44.805 | time=2025-06-05T03:36:44.805Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:45.028 | time=2025-06-05T03:36:45.028Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=32,30 memory.available="[11.0 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="21.6 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[10.8 GiB 10.8 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB" projector.weights="818.0 MiB" projector.graph="0 B" 2025-06-04 23:36:45.028 | time=2025-06-05T03:36:45.028Z level=INFO source=server.go:186 msg="enabling flash attention" 2025-06-04 23:36:45.097 | time=2025-06-05T03:36:45.097Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:45.098 | time=2025-06-05T03:36:45.098Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:36:45.102 | time=2025-06-05T03:36:45.102Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 2025-06-04 23:36:45.106 | time=2025-06-05T03:36:45.106Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 2025-06-04 23:36:45.108 | time=2025-06-05T03:36:45.108Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 62 --threads 8 --flash-attn --kv-cache-type q8_0 --no-mmap --parallel 1 --tensor-split 32,30 --port 34837" 2025-06-04 23:36:45.109 | time=2025-06-05T03:36:45.108Z level=INFO source=sched.go:452 msg="loaded runners" count=1 2025-06-04 23:36:45.109 | time=2025-06-05T03:36:45.109Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" 2025-06-04 23:36:45.110 | time=2025-06-05T03:36:45.110Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" 2025-06-04 23:36:45.120 | time=2025-06-05T03:36:45.120Z level=INFO source=runner.go:851 msg="starting ollama engine" 2025-06-04 23:36:45.131 | time=2025-06-05T03:36:45.131Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:34837" 2025-06-04 23:36:45.201 | time=2025-06-05T03:36:45.201Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:36:45.202 | time=2025-06-05T03:36:45.202Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" 2025-06-04 23:36:45.202 | time=2025-06-05T03:36:45.202Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="Gemma 3 27b It Qat" description="" num_tensors=808 num_key_values=45 2025-06-04 23:36:45.224 | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so 2025-06-04 23:36:45.362 | time=2025-06-05T03:36:45.362Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" 2025-06-04 23:36:45.573 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2025-06-04 23:36:45.573 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2025-06-04 23:36:45.573 | ggml_cuda_init: found 2 CUDA devices: 2025-06-04 23:36:45.573 | Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes 2025-06-04 23:36:45.573 | Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes 2025-06-04 23:36:45.689 | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so 2025-06-04 23:36:45.689 | time=2025-06-05T03:36:45.689Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.2 GiB" 2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.0 GiB" 2025-06-04 23:36:45.896 | time=2025-06-05T03:36:45.896Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.5 GiB" 2025-06-04 23:37:02.601 | [GIN] 2025/06/05 - 03:37:02 | 200 | 1.585813342s | 172.18.0.5 | GET "/api/tags" 2025-06-04 23:37:02.603 | [GIN] 2025/06/05 - 03:37:02 | 200 | 47.182µs | 172.18.0.5 | GET "/api/ps" 2025-06-04 23:37:04.581 | [GIN] 2025/06/05 - 03:37:04 | 200 | 1.475773577s | 172.18.0.5 | GET "/api/tags" 2025-06-04 23:37:04.583 | [GIN] 2025/06/05 - 03:37:04 | 200 | 41.021µs | 172.18.0.5 | GET "/api/ps" 2025-06-04 23:37:23.093 | time=2025-06-05T03:37:23.092Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:37:23.097 | time=2025-06-05T03:37:23.096Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 2025-06-04 23:37:23.101 | time=2025-06-05T03:37:23.100Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="154.5 MiB" 2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="138.5 MiB" 2025-06-04 23:37:23.558 | time=2025-06-05T03:37:23.558Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" 2025-06-04 23:37:23.761 | time=2025-06-05T03:37:23.761Z level=INFO source=server.go:628 msg="llama runner started in 38.65 seconds" 2025-06-04 23:38:08.008 | [GIN] 2025/06/05 - 03:38:08 | 200 | 1m25s | 172.18.0.5 | POST "/api/chat" 2025-06-04 23:38:08.220 | time=2025-06-05T03:38:08.219Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:13.258 | time=2025-06-05T03:38:13.258Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.036824613 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 2025-06-04 23:38:13.534 | time=2025-06-05T03:38:13.534Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.313355817 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 2025-06-04 23:38:13.761 | time=2025-06-05T03:38:13.761Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.540399162 model=/root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 2025-06-04 23:38:14.019 | time=2025-06-05T03:38:14.019Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.076 | time=2025-06-05T03:38:14.076Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.079 | time=2025-06-05T03:38:14.079Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.301 | time=2025-06-05T03:38:14.301Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.523 | time=2025-06-05T03:38:14.523Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.763 | time=2025-06-05T03:38:14.763Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:14.979 | time=2025-06-05T03:38:14.979Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:15.202 | time=2025-06-05T03:38:15.202Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:15.635 | time=2025-06-05T03:38:15.635Z level=INFO source=server.go:106 msg="system memory" total="47.0 GiB" free="39.7 GiB" free_swap="12.0 GiB" 2025-06-04 23:38:15.637 | time=2025-06-05T03:38:15.637Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:15.853 | time=2025-06-05T03:38:15.853Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=62 layers.split=32,30 memory.available="[11.0 GiB 11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="21.6 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[10.8 GiB 10.8 GiB]" memory.weights.total="14.5 GiB" memory.weights.repeating="13.5 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.2 GiB" memory.graph.partial="2.2 GiB" projector.weights="818.0 MiB" projector.graph="0 B" 2025-06-04 23:38:15.853 | time=2025-06-05T03:38:15.853Z level=INFO source=server.go:186 msg="enabling flash attention" 2025-06-04 23:38:15.918 | time=2025-06-05T03:38:15.918Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:15.919 | time=2025-06-05T03:38:15.919Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:38:15.923 | time=2025-06-05T03:38:15.923Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 2025-06-04 23:38:15.927 | time=2025-06-05T03:38:15.927Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 2025-06-04 23:38:15.928 | time=2025-06-05T03:38:15.928Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-4f1e32db877a9339df2d6529c1635570425cbe81f0aa3f7dd5d1452f2e632b42 --ctx-size 32768 --batch-size 512 --n-gpu-layers 62 --threads 8 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 32,30 --port 45577" 2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.928Z level=INFO source=sched.go:452 msg="loaded runners" count=1 2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.928Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" 2025-06-04 23:38:15.929 | time=2025-06-05T03:38:15.929Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" 2025-06-04 23:38:15.942 | time=2025-06-05T03:38:15.942Z level=INFO source=runner.go:851 msg="starting ollama engine" 2025-06-04 23:38:15.953 | time=2025-06-05T03:38:15.953Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:45577" 2025-06-04 23:38:16.020 | time=2025-06-05T03:38:16.019Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 2025-06-04 23:38:16.021 | time=2025-06-05T03:38:16.020Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" 2025-06-04 23:38:16.021 | time=2025-06-05T03:38:16.020Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="Gemma 3 27b It Qat" description="" num_tensors=808 num_key_values=45 2025-06-04 23:38:16.025 | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so 2025-06-04 23:38:16.108 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2025-06-04 23:38:16.108 | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2025-06-04 23:38:16.108 | ggml_cuda_init: found 2 CUDA devices: 2025-06-04 23:38:16.108 | Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes 2025-06-04 23:38:16.108 | Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes 2025-06-04 23:38:16.180 | time=2025-06-05T03:38:16.180Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" 2025-06-04 23:38:16.201 | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so 2025-06-04 23:38:16.201 | time=2025-06-05T03:38:16.201Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="7.0 GiB" 2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="6.5 GiB" 2025-06-04 23:38:16.396 | time=2025-06-05T03:38:16.396Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="2.2 GiB" 2025-06-04 23:38:54.231 | time=2025-06-05T03:38:54.231Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.num_channels default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.block_count default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.embedding_length default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.head_count default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.image_size default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.patch_size default=0 2025-06-04 23:38:54.235 | time=2025-06-05T03:38:54.235Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.vision.attention.layer_norm_epsilon default=0 2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 2025-06-04 23:38:54.239 | time=2025-06-05T03:38:54.239Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="154.5 MiB" 2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="138.5 MiB" 2025-06-04 23:38:54.840 | time=2025-06-05T03:38:54.840Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" 2025-06-04 23:38:55.077 | time=2025-06-05T03:38:55.077Z level=INFO source=server.go:628 msg="llama runner started in 39.15 seconds" 2025-06-04 23:38:58.031 | [GIN] 2025/06/05 - 03:38:58 | 200 | 49.997572682s | 172.18.0.5 | POST "/api/chat" 2025-06-04 23:39:17.494 | [GIN] 2025/06/05 - 03:39:17 | 200 | 22.19µs | 127.0.0.1 | HEAD "/" 2025-06-04 23:39:17.494 | [GIN] 2025/06/05 - 03:39:17 | 200 | 20.351µs | 127.0.0.1 | GET "/api/ps" ### Additional Information I suspect there may be a mismatch in parameters used in the title generation task, leading to the current model being reloaded. I'm not sure if it's related to the keep_alive parameter, because even when I run Ollama container with the OLLAMA_KEEP_ALIVE=-1 variable, the model is also reloaded for title generation.

GiteaMirror added the bug label 2026-04-19 23:04:11 -05:00

GiteaMirror commented

2026-04-19 23:04:12 -05:00

@rgaricano commented on GitHub (Jun 6, 2025):

If ollamas keep_alive have a diferent set that the one sent then ollama reload model, you can try setting ollama's keep alive as -1, and keep openwebui ones to default (or to -1 too), in this case ollama only reload when model (or other parameters) are changed (it also depend of ollama set how many model to keep on memory)

@rgaricano commented on GitHub (Jun 6, 2025): If ollamas keep_alive have a diferent set that the one sent then ollama reload model, you can try setting ollama's keep alive as -1, and keep openwebui ones to default (or to -1 too), in this case ollama only reload when model (or other parameters) are changed (it also depend of ollama set how many model to keep on memory)

GiteaMirror commented

2026-04-19 23:04:13 -05:00

@trinhkvo commented on GitHub (Jun 13, 2025):

There seems to be something wrong with the parameters included in the queries for tasks. I upgrade to v.0.6.14 and still see that the model is reloaded for every new prompt in the same conversations.
I tested with Ollama from the terminal and this reloading issue does not occur, so it's not an issue with Ollama.
Then I disabled Tag generation, Title generation, Follow up generation in the Interface setting, and the issue was gone - I could continue to chat in the same conversation without the model being reloaded.

@trinhkvo commented on GitHub (Jun 13, 2025): There seems to be something wrong with the parameters included in the queries for tasks. I upgrade to v.0.6.14 and still see that the model is reloaded for every new prompt in the same conversations. I tested with Ollama from the terminal and this reloading issue does not occur, so it's not an issue with Ollama. Then I disabled Tag generation, Title generation, Follow up generation in the Interface setting, and the issue was gone - I could continue to chat in the same conversation without the model being reloaded.

GiteaMirror commented

2026-04-19 23:04:14 -05:00

@rgaricano commented on GitHub (Jun 13, 2025):

how do you have set Local task model? Current Model?

@rgaricano commented on GitHub (Jun 13, 2025): how do you have set Local task model? Current Model?

GiteaMirror commented

2026-04-19 23:04:14 -05:00

@trinhkvo commented on GitHub (Jun 14, 2025):

Yes, Current model.
Task model selection

Current model - Title, Tag, Follow up all on: Model is reloaded for each new message in the conversation.
Current model - Title, Tag, Follow up all off: No model reloading.
Different tiny model to use for Task - Title Tag, Follow up all on: No model reloading. For some reason, Follow up cannot be generated, but that's for another issue report.

@trinhkvo commented on GitHub (Jun 14, 2025): Yes, Current model. Task model selection - Current model - Title, Tag, Follow up all on: Model is reloaded for each new message in the conversation. - Current model - Title, Tag, Follow up all off: No model reloading. - Different tiny model to use for Task - Title Tag, Follow up all on: No model reloading. For some reason, Follow up cannot be generated, but that's for another issue report.

GiteaMirror commented

2026-04-19 23:04:15 -05:00

@rgaricano commented on GitHub (Jun 14, 2025):

You are using different context lenght,

default ollama is set to: OLLAMA_CONTEXT_LENGTH:4096

and request ask for: --ctx-size 32768

You can try setting default ollama context lenght to 32768 or model used to 4096,
in any case, that both be the same. (or just set one in ollama and left defauts in Open-WebUI)

@rgaricano commented on GitHub (Jun 14, 2025): You are using different context lenght, default ollama is set to: OLLAMA_CONTEXT_LENGTH:4096 and request ask for: --ctx-size 32768 You can try setting default ollama context lenght to 32768 or model used to 4096, in any case, that both be the same. (or just set one in ollama and left defauts in Open-WebUI)

GiteaMirror referenced this issue

2026-04-20 00:27:01 -05:00

[GH-ISSUE #17332] issue: Hybrid scroll loops + reranker overrides system prompt (0.6.25 OK, latest very slow) #18244

GiteaMirror referenced this issue

2026-04-25 07:39:21 -05:00

[GH-ISSUE #17332] issue: Hybrid scroll loops + reranker overrides system prompt (0.6.25 OK, latest very slow) #33773

GiteaMirror referenced this issue

2026-05-05 20:13:20 -05:00

[GH-ISSUE #17332] issue: Hybrid scroll loops + reranker overrides system prompt (0.6.25 OK, latest very slow) #56910

Sign in to join this conversation.