[GH-ISSUE #11982] ollama 0.11.5: gpt-oss120b produce nonsense output (or no output at all) #7955

Open
opened 2026-04-12 20:07:45 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ka-admin on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11982

What is the issue?

When using ollama 0.11.5 with gpt-oss120b and OLLAMA_NEW_ESTIMATES=1 the LLM produces nonsense output or no output at all (in Open-WebUI). I've noticed this behaviour in RC builds and release still has it issue. Disabling OLLAMA_NEW_ESTIMATES fix the problem.

Relevant log output

journalctl -u ollama --no-pager --follow --pager-end
Aug 20 09:32:38 systemd[1]: Started ollama.service - Ollama Service.
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.396+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 |   12.401326ms |  192.168.127.20 | GET      "/api/tags"
Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 |     669.615µs |  192.168.127.20 | GET      "/api/ps"
Aug 20 09:35:25 ollama[1914]: [GIN] 2025/08/20 - 09:35:25 | 200 |      57.589µs |  192.168.127.20 | GET      "/api/version"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.220+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36793"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.577+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.578+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:36793"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="173.0 GiB" free_swap="8.0 GiB"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.905+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.942+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: found 3 CUDA devices:
Aug 20 09:36:23 ollama[1914]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 20 09:36:23 ollama[1914]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 20 09:36:23 ollama[1914]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 20 09:36:23 ollama[1914]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 20 09:36:23 ollama[1914]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.221+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.335+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.374+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.721+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.729+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 20 09:36:41 ollama[1914]: time=2025-08-20T09:36:41.522+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.96 seconds"
Aug 20 09:36:51 ollama[1914]: [GIN] 2025/08/20 - 09:36:51 | 200 |      53.241µs |  192.168.127.20 | GET      "/api/version"
Aug 20 09:38:05 ollama[1914]: [GIN] 2025/08/20 - 09:38:05 | 200 |         1m43s |  192.168.127.20 | POST     "/api/chat"
Aug 20 09:39:56 ollama[1914]: [GIN] 2025/08/20 - 09:39:56 | 200 |         1m51s |  192.168.127.20 | POST     "/api/chat"



ollama run gpt-oss:120b
>>> Hi, introduce yourself
The sun was low over the river when Jake pushed his boat into the water. He had been at the front for three years, and the only thing he remembered now was the way the current tugged at the oars and the quiet
sound of water against wood.

He rowed out past the old stone bridge, past the fields that had once been green and were now scarred with craters. The wind smelled of wheat and smoke. He thought of Maria, of the night they had kissed under the
low lamp in the village, before the night fell and the bombs began.

He let the oars rest for a moment, feeling the rhythm of the river beneath his hands. The water was calm, but his mind was not. He saw the faces of his comrades, the hollow eyes of men who had watched the world
burn, and the child he had left behind in the hills.

When the sun finally sank behind the hills, Jake turned the boat back toward the shore. He could hear the distant rumble of artillery, but the river carried his thoughts away. He knew that some wounds never close,
and some nights never end, but he also knew that a man could sit in a boat and still feel the pulse of life in the quiet flow of water.

He pulled the boat up onto the sand, stepped out, and walked toward the town, his boots leaving faint marks in the dust. In the distance, a dog barked. He smiled, a small, tired smile, and kept walking, his heart
beating in time with the river he left behind.

>>> Send a message (/? for help)


I'm using Ubuntu Server 25.04 with AMD Ryzen 9 7950x and 3 GPU: 2x 4090 RTX + Tesla V100 SXM2 32GB. 192Gb RAM

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.5

Originally created by @ka-admin on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11982 ### What is the issue? When using ollama 0.11.5 with gpt-oss120b and OLLAMA_NEW_ESTIMATES=1 the LLM produces nonsense output or no output at all (in Open-WebUI). I've noticed this behaviour in RC builds and release still has it issue. Disabling OLLAMA_NEW_ESTIMATES fix the problem. ### Relevant log output ```shell journalctl -u ollama --no-pager --follow --pager-end Aug 20 09:32:38 systemd[1]: Started ollama.service - Ollama Service. Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.396+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 | 12.401326ms | 192.168.127.20 | GET "/api/tags" Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 | 669.615µs | 192.168.127.20 | GET "/api/ps" Aug 20 09:35:25 ollama[1914]: [GIN] 2025/08/20 - 09:35:25 | 200 | 57.589µs | 192.168.127.20 | GET "/api/version" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.220+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36793" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38 Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.577+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.578+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:36793" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="173.0 GiB" free_swap="8.0 GiB" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.905+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.942+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: found 3 CUDA devices: Aug 20 09:36:23 ollama[1914]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 20 09:36:23 ollama[1914]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 20 09:36:23 ollama[1914]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 20 09:36:23 ollama[1914]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 20 09:36:23 ollama[1914]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.221+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.335+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.374+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.721+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.729+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 20 09:36:41 ollama[1914]: time=2025-08-20T09:36:41.522+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.96 seconds" Aug 20 09:36:51 ollama[1914]: [GIN] 2025/08/20 - 09:36:51 | 200 | 53.241µs | 192.168.127.20 | GET "/api/version" Aug 20 09:38:05 ollama[1914]: [GIN] 2025/08/20 - 09:38:05 | 200 | 1m43s | 192.168.127.20 | POST "/api/chat" Aug 20 09:39:56 ollama[1914]: [GIN] 2025/08/20 - 09:39:56 | 200 | 1m51s | 192.168.127.20 | POST "/api/chat" ollama run gpt-oss:120b >>> Hi, introduce yourself The sun was low over the river when Jake pushed his boat into the water. He had been at the front for three years, and the only thing he remembered now was the way the current tugged at the oars and the quiet sound of water against wood. He rowed out past the old stone bridge, past the fields that had once been green and were now scarred with craters. The wind smelled of wheat and smoke. He thought of Maria, of the night they had kissed under the low lamp in the village, before the night fell and the bombs began. He let the oars rest for a moment, feeling the rhythm of the river beneath his hands. The water was calm, but his mind was not. He saw the faces of his comrades, the hollow eyes of men who had watched the world burn, and the child he had left behind in the hills. When the sun finally sank behind the hills, Jake turned the boat back toward the shore. He could hear the distant rumble of artillery, but the river carried his thoughts away. He knew that some wounds never close, and some nights never end, but he also knew that a man could sit in a boat and still feel the pulse of life in the quiet flow of water. He pulled the boat up onto the sand, stepped out, and walked toward the town, his boots leaving faint marks in the dust. In the distance, a dog barked. He smiled, a small, tired smile, and kept walking, his heart beating in time with the river he left behind. >>> Send a message (/? for help) I'm using Ubuntu Server 25.04 with AMD Ryzen 9 7950x and 3 GPU: 2x 4090 RTX + Tesla V100 SXM2 32GB. 192Gb RAM ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.5
GiteaMirror added the bug label 2026-04-12 20:07:45 -05:00
Author
Owner

@jessegross commented on GitHub (Aug 20, 2025):

Can you please post the logs with OLLAMA_NEW_ESTIMATES off?

<!-- gh-comment-id:3208198908 --> @jessegross commented on GitHub (Aug 20, 2025): Can you please post the logs with OLLAMA_NEW_ESTIMATES off?
Author
Owner

@ka-admin commented on GitHub (Aug 21, 2025):

 Started ollama.service - Ollama Service.
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.591+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.593+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 |       54.06µs |       127.0.0.1 | HEAD     "/"
Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 |   72.204534ms |       127.0.0.1 | POST     "/api/show"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45539"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:45539"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.438+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.8 GiB" free_swap="8.0 GiB"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.410+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 library=cuda parallel=1 required="70.9 GiB" gpus=3
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.729+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="[15 11 11]" memory.available="[31.4 GiB 23.1 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="70.9 GiB" memory.required.partial="70.9 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[27.6 GiB 21.6 GiB 21.6 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.730+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.764+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: found 3 CUDA devices:
Aug 21 09:10:36 ollama[6191]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 21 09:10:36 ollama[6191]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 21 09:10:36 ollama[6191]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 21 09:10:36 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 21 09:10:36 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.940+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 21 09:10:53 ollama[6191]: time=2025-08-21T09:10:53.147+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.03 seconds"
Aug 21 09:10:53 ollama[6191]: [GIN] 2025/08/21 - 09:10:53 | 200 | 18.925510525s |       127.0.0.1 | POST     "/api/generate"
Aug 21 09:11:16 ollama[6191]: [GIN] 2025/08/21 - 09:11:16 | 200 |  551.935404ms |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:20 ollama[6191]: [GIN] 2025/08/21 - 09:11:20 | 200 |  257.347022ms |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:23 ollama[6191]: [GIN] 2025/08/21 - 09:11:23 | 200 |   1.21404796s |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:44 ollama[6191]: [GIN] 2025/08/21 - 09:11:44 | 200 |  17.74461018s |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |   11.433904ms |  192.168.127.20 | GET      "/api/tags"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |        42.5µs |  192.168.127.20 | GET      "/api/ps"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |      49.152µs |  192.168.127.20 | GET      "/api/version"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |      34.234µs |  192.168.127.20 | GET      "/api/version"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |     982.593µs |  192.168.127.20 | GET      "/api/tags"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |      12.784µs |  192.168.127.20 | GET      "/api/ps"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34605"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.691+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.692+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34605"
Aug 21 09:14:37 ollama[6191]: time=2025-08-21T09:14:37.003+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="174.9 GiB" free_swap="8.0 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.293+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.0 GiB 23.1 GiB 31.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.294+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.334+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: found 3 CUDA devices:
Aug 21 09:14:38 ollama[6191]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 21 09:14:38 ollama[6191]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 21 09:14:38 ollama[6191]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 21 09:14:38 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 21 09:14:38 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.521+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 21 09:14:58 ollama[6191]: time=2025-08-21T09:14:58.231+03:00 level=INFO source=server.go:1272 msg="llama runner started in 21.55 seconds"
Aug 21 09:15:06 ollama[6191]: [GIN] 2025/08/21 - 09:15:06 | 200 | 33.234048006s |  192.168.127.20 | POST     "/api/chat"
Aug 21 09:15:11 ollama[6191]: [GIN] 2025/08/21 - 09:15:11 | 200 |   5.62881336s |  192.168.127.20 | POST     "/api/chat"

It seems to have nothing to do with OLLAMA_NEW_ESTIMATES (at least in ollama console chat). I'm apologize

ollama run gpt-oss:120b
>>> Hi, introduce yourself


>>> Hi, introduce yourself


>>> Hi, introduce yourself


>>> Hi
Below is a **step‑by‑step explanation** of why each element in the process is needed and what purpose it serves.
The format is *“Step N – What is done → Why it is necessary”* so you can see the reasoning behind every action.

---

### Step 1 – Identify the Goal
**What is done:** Clearly define what you want to achieve (e.g., solve a problem, build a feature, answer a question).
**Why it is necessary:** Without a concrete goal you can’t decide which actions are relevant, you may waste effort on unrelated work, and you won’t have a way to measure success.

### Step 2 – Gather Requirements / Constraints
**What is done:** Collect all functional and non‑functional requirements, inputs, outputs, performance limits, security policies, etc.
**Why it is necessary:** Requirements tell you **what** the solution must do and **how** it must behave. Ignoring them leads to a solution that either fails to meet user needs or breaks system rules.

### Step 3 – Perform a High‑Level Analysis
**What is done:** Break the problem into major components or phases (e.g., input processing, core logic, output generation).
**Why it is necessary:** A high‑level view helps you see the overall structure, spot dependencies early, and avoid “tunnel vision” on a single part while forgetting the rest.

### Step 4 – Choose an Architecture / Design Pattern
**What is done:** Select a suitable architectural style (layered, micro‑services, event‑driven, etc.) or design pattern (Factory, Observer, MVC, etc.).
**Why it is necessary:** Good architecture ensures scalability, maintainability, and testability. A pattern provides a proven solution to a recurring problem, reducing the chance of hidden bugs.

### Step 5 – Create Detailed Design (Algorithms & Data Structures)
**What is done:** Specify exact algorithms, data structures, class diagrams, API contracts, and interaction flows.
**Why it is necessary:** Details turn abstract ideas into implementable code. Choosing the right algorithm/data structure can drastically affect performance and memory usage.

### Step 6 – Validate the Design (Peer Review / Walkthrough)
**What is done:** Have teammates review the design, run design‑by‑example scenarios, or create quick prototypes.
**Why it is necessary:** Early feedback catches logical errors, ambiguous specifications, and unrealistic assumptions before any code is written, saving time later.

### Step 7 – Set Up the Development Environment
**What is done:** Install compilers, libraries, CI pipelines, version‑control hooks, and configure build scripts.
**Why it is necessary:** A consistent environment prevents “works on my machine” problems, enables automated testing, and speeds up iteration.

### Step 8 – Implement the Solution Incrementally
**What is done:** Write code in small, testable units (functions, classes, modules) and commit frequently.
**Why it is necessary:** Small increments make it easier to locate defects, allow continuous integration, and let you get early feedback from tests or users.

### Step 9 – Write Automated Tests (Unit / Integration / Acceptance)
**What is done:** Create tests that verify each component works in isolation and together as a system.
**Why it is necessary:** Tests provide a safety net against regressions, document expected behavior, and give confidence when refactoring or extending the code.

### Step 10 – Run Static Analysis / Code Review
**What is done:** Use linters, type checkers, security scanners, and have peers review the code.
**Why it is necessary:** Static checks catch style violations, potential bugs, and security issues that tests might miss, improving overall code quality.

### Step 11 – Perform Performance & Load Testing
**What is done:** Measure latency, throughput, memory consumption under realistic workloads.
**Why it is necessary:** Even a functionally correct system can fail in production if it can’t handle the expected load or meets performance SLAs.

### Step 12 – Deploy to a Staging Environment
**What is done:** Push the build to an environment that mirrors production (same configuration, databases, network).
**Why it is necessary:** Staging validates that deployment scripts, environment variables, and external integrations work correctly before affecting real users.

### Step 13 – Conduct Acceptance Testing / User Validation
**What is done:** Let stakeholders or a beta‑user group interact with the system and confirm it meets their needs.
**Why it is necessary:** Stakeholder sign‑off ensures the product solves the real problem; hidden usability issues are often discovered only at this stage.

### Step 14 – Deploy to Production
**What is done:** Release the vetted code to the live environment, often using a rollout strategy (blue‑green, canary, rolling).
**Why it is necessary:** A controlled rollout minimizes risk; if something goes wrong you can quickly revert or isolate the issue.

### Step 15 – Monitor & Observe in Production
**What is done:** Collect logs, metrics, traces, and alerts to track health and performance.
**Why it is necessary:** Monitoring detects anomalies early, enabling rapid response before users are impacted; it also provides data for future improvements.

### Step 16 – Gather Feedback & Iterate
**What is done:** Analyze user feedback, error reports, and performance data; plan enhancements or bug fixes.
**Why it is necessary:** Software is never truly finished; continuous improvement keeps the solution relevant, secure, and efficient over time.

---

#### Summary

| # | Step | Core Reason |
|---|------|--------------|
| 1 | Define Goal | Gives direction & success criteria |
| 2 | Gather Requirements | Ensures solution meets real needs |
| 3 | High‑Level Analysis | Reveals overall structure & dependencies |
| 4 | Choose Architecture | Guarantees scalability & maintainability |
| 5 | Detailed Design | Turns concepts into concrete, efficient code |
| 6 | Validate Design | Catches mistakes before coding |
| 7 | Set Up Env. | Prevents environment‑related bugs |
| 8 | Incremental Implementation | Makes debugging & integration easier |
| 9 | Automated Tests | Protects against regressions |
|10| Static Analysis / Review | Improves code quality & security |
|11| Performance Testing | Verifies the system can handle load |
|12| Staging Deployment | Confirms deployment process works |
|13| Acceptance Testing | Confirms it solves the real problem |
|14| Production Deploy | Delivers value to end users |
|15| Monitoring | Detects issues early in live use |
|16| Feedback Loop | Drives continuous improvement |

Following these steps, each **necessary** because it either prevents future problems, validates correctness, or ensures the product delivers real value, leads to a robust, maintainable, and user‑focused solution.

>>>

But it works in Open-WebUI v0.6.22[(latest). Although there still some format mess especially with thinking block, but it could be Open WebUI problem not Ollama's output.

Image
<!-- gh-comment-id:3209205167 --> @ka-admin commented on GitHub (Aug 21, 2025): ``` Started ollama.service - Ollama Service. Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.591+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.593+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 | 54.06µs | 127.0.0.1 | HEAD "/" Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 | 72.204534ms | 127.0.0.1 | POST "/api/show" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45539" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:45539" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.438+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.8 GiB" free_swap="8.0 GiB" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.410+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 library=cuda parallel=1 required="70.9 GiB" gpus=3 Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.729+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="[15 11 11]" memory.available="[31.4 GiB 23.1 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="70.9 GiB" memory.required.partial="70.9 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[27.6 GiB 21.6 GiB 21.6 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.730+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.764+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: found 3 CUDA devices: Aug 21 09:10:36 ollama[6191]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 21 09:10:36 ollama[6191]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 21 09:10:36 ollama[6191]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 21 09:10:36 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 21 09:10:36 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.940+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 21 09:10:53 ollama[6191]: time=2025-08-21T09:10:53.147+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.03 seconds" Aug 21 09:10:53 ollama[6191]: [GIN] 2025/08/21 - 09:10:53 | 200 | 18.925510525s | 127.0.0.1 | POST "/api/generate" Aug 21 09:11:16 ollama[6191]: [GIN] 2025/08/21 - 09:11:16 | 200 | 551.935404ms | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:20 ollama[6191]: [GIN] 2025/08/21 - 09:11:20 | 200 | 257.347022ms | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:23 ollama[6191]: [GIN] 2025/08/21 - 09:11:23 | 200 | 1.21404796s | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:44 ollama[6191]: [GIN] 2025/08/21 - 09:11:44 | 200 | 17.74461018s | 127.0.0.1 | POST "/api/chat" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 11.433904ms | 192.168.127.20 | GET "/api/tags" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 42.5µs | 192.168.127.20 | GET "/api/ps" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 49.152µs | 192.168.127.20 | GET "/api/version" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 34.234µs | 192.168.127.20 | GET "/api/version" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 982.593µs | 192.168.127.20 | GET "/api/tags" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 12.784µs | 192.168.127.20 | GET "/api/ps" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34605" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.691+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.692+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34605" Aug 21 09:14:37 ollama[6191]: time=2025-08-21T09:14:37.003+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="174.9 GiB" free_swap="8.0 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.293+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.0 GiB 23.1 GiB 31.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.294+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.334+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: found 3 CUDA devices: Aug 21 09:14:38 ollama[6191]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 21 09:14:38 ollama[6191]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 21 09:14:38 ollama[6191]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 21 09:14:38 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 21 09:14:38 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.521+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 21 09:14:58 ollama[6191]: time=2025-08-21T09:14:58.231+03:00 level=INFO source=server.go:1272 msg="llama runner started in 21.55 seconds" Aug 21 09:15:06 ollama[6191]: [GIN] 2025/08/21 - 09:15:06 | 200 | 33.234048006s | 192.168.127.20 | POST "/api/chat" Aug 21 09:15:11 ollama[6191]: [GIN] 2025/08/21 - 09:15:11 | 200 | 5.62881336s | 192.168.127.20 | POST "/api/chat" ``` It seems to have nothing to do with OLLAMA_NEW_ESTIMATES (at least in ollama console chat). I'm apologize ``` ollama run gpt-oss:120b >>> Hi, introduce yourself >>> Hi, introduce yourself >>> Hi, introduce yourself >>> Hi Below is a **step‑by‑step explanation** of why each element in the process is needed and what purpose it serves. The format is *“Step N – What is done → Why it is necessary”* so you can see the reasoning behind every action. --- ### Step 1 – Identify the Goal **What is done:** Clearly define what you want to achieve (e.g., solve a problem, build a feature, answer a question). **Why it is necessary:** Without a concrete goal you can’t decide which actions are relevant, you may waste effort on unrelated work, and you won’t have a way to measure success. ### Step 2 – Gather Requirements / Constraints **What is done:** Collect all functional and non‑functional requirements, inputs, outputs, performance limits, security policies, etc. **Why it is necessary:** Requirements tell you **what** the solution must do and **how** it must behave. Ignoring them leads to a solution that either fails to meet user needs or breaks system rules. ### Step 3 – Perform a High‑Level Analysis **What is done:** Break the problem into major components or phases (e.g., input processing, core logic, output generation). **Why it is necessary:** A high‑level view helps you see the overall structure, spot dependencies early, and avoid “tunnel vision” on a single part while forgetting the rest. ### Step 4 – Choose an Architecture / Design Pattern **What is done:** Select a suitable architectural style (layered, micro‑services, event‑driven, etc.) or design pattern (Factory, Observer, MVC, etc.). **Why it is necessary:** Good architecture ensures scalability, maintainability, and testability. A pattern provides a proven solution to a recurring problem, reducing the chance of hidden bugs. ### Step 5 – Create Detailed Design (Algorithms & Data Structures) **What is done:** Specify exact algorithms, data structures, class diagrams, API contracts, and interaction flows. **Why it is necessary:** Details turn abstract ideas into implementable code. Choosing the right algorithm/data structure can drastically affect performance and memory usage. ### Step 6 – Validate the Design (Peer Review / Walkthrough) **What is done:** Have teammates review the design, run design‑by‑example scenarios, or create quick prototypes. **Why it is necessary:** Early feedback catches logical errors, ambiguous specifications, and unrealistic assumptions before any code is written, saving time later. ### Step 7 – Set Up the Development Environment **What is done:** Install compilers, libraries, CI pipelines, version‑control hooks, and configure build scripts. **Why it is necessary:** A consistent environment prevents “works on my machine” problems, enables automated testing, and speeds up iteration. ### Step 8 – Implement the Solution Incrementally **What is done:** Write code in small, testable units (functions, classes, modules) and commit frequently. **Why it is necessary:** Small increments make it easier to locate defects, allow continuous integration, and let you get early feedback from tests or users. ### Step 9 – Write Automated Tests (Unit / Integration / Acceptance) **What is done:** Create tests that verify each component works in isolation and together as a system. **Why it is necessary:** Tests provide a safety net against regressions, document expected behavior, and give confidence when refactoring or extending the code. ### Step 10 – Run Static Analysis / Code Review **What is done:** Use linters, type checkers, security scanners, and have peers review the code. **Why it is necessary:** Static checks catch style violations, potential bugs, and security issues that tests might miss, improving overall code quality. ### Step 11 – Perform Performance & Load Testing **What is done:** Measure latency, throughput, memory consumption under realistic workloads. **Why it is necessary:** Even a functionally correct system can fail in production if it can’t handle the expected load or meets performance SLAs. ### Step 12 – Deploy to a Staging Environment **What is done:** Push the build to an environment that mirrors production (same configuration, databases, network). **Why it is necessary:** Staging validates that deployment scripts, environment variables, and external integrations work correctly before affecting real users. ### Step 13 – Conduct Acceptance Testing / User Validation **What is done:** Let stakeholders or a beta‑user group interact with the system and confirm it meets their needs. **Why it is necessary:** Stakeholder sign‑off ensures the product solves the real problem; hidden usability issues are often discovered only at this stage. ### Step 14 – Deploy to Production **What is done:** Release the vetted code to the live environment, often using a rollout strategy (blue‑green, canary, rolling). **Why it is necessary:** A controlled rollout minimizes risk; if something goes wrong you can quickly revert or isolate the issue. ### Step 15 – Monitor & Observe in Production **What is done:** Collect logs, metrics, traces, and alerts to track health and performance. **Why it is necessary:** Monitoring detects anomalies early, enabling rapid response before users are impacted; it also provides data for future improvements. ### Step 16 – Gather Feedback & Iterate **What is done:** Analyze user feedback, error reports, and performance data; plan enhancements or bug fixes. **Why it is necessary:** Software is never truly finished; continuous improvement keeps the solution relevant, secure, and efficient over time. --- #### Summary | # | Step | Core Reason | |---|------|--------------| | 1 | Define Goal | Gives direction & success criteria | | 2 | Gather Requirements | Ensures solution meets real needs | | 3 | High‑Level Analysis | Reveals overall structure & dependencies | | 4 | Choose Architecture | Guarantees scalability & maintainability | | 5 | Detailed Design | Turns concepts into concrete, efficient code | | 6 | Validate Design | Catches mistakes before coding | | 7 | Set Up Env. | Prevents environment‑related bugs | | 8 | Incremental Implementation | Makes debugging & integration easier | | 9 | Automated Tests | Protects against regressions | |10| Static Analysis / Review | Improves code quality & security | |11| Performance Testing | Verifies the system can handle load | |12| Staging Deployment | Confirms deployment process works | |13| Acceptance Testing | Confirms it solves the real problem | |14| Production Deploy | Delivers value to end users | |15| Monitoring | Detects issues early in live use | |16| Feedback Loop | Drives continuous improvement | Following these steps, each **necessary** because it either prevents future problems, validates correctness, or ensures the product delivers real value, leads to a robust, maintainable, and user‑focused solution. >>> ``` But it works in Open-WebUI v0.6.22[(latest). Although there still some format mess especially with thinking block, but it could be Open WebUI problem not Ollama's output. <img width="1027" height="526" alt="Image" src="https://github.com/user-attachments/assets/ce388746-49a1-4ee0-9e21-8d2405bb8e1a" />
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

Thanks for the update that it is not related to OLLAMA_NEW_ESTIMATES. Possibly Open WebUI is calling Ollama with different settings that avoid the problem. Can you post the log with Open WebUI when it works fine?

<!-- gh-comment-id:3211512447 --> @jessegross commented on GitHub (Aug 21, 2025): Thanks for the update that it is not related to OLLAMA_NEW_ESTIMATES. Possibly Open WebUI is calling Ollama with different settings that avoid the problem. Can you post the log with Open WebUI when it works fine?
Author
Owner

@ka-admin commented on GitHub (Aug 22, 2025):

sure

With OLLAMA_NEW_ESTIMATES = 0


journalctl -u ollama --no-pager --follow --pager-end
Aug 22 09:39:04 systemd[1]: Started ollama.service - Ollama Service.
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.089+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.108+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.109+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |   12.525484ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |      44.403µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |      50.274µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |    1.057968ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |       8.275µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |      32.922µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 |    1.062437ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 |       9.428µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 |    1.186741ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 |      13.305µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 |     1.47887ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 |       8.576µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:10 ollama[1983]: [GIN] 2025/08/22 - 10:05:10 | 200 |      33.072µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:06:48 ollama[1983]: [GIN] 2025/08/22 - 10:06:48 | 200 |      36.148µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.703+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 43039"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:43039"
Aug 22 10:07:06 ollama[1983]: time=2025-08-22T10:07:06.039+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.378+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.387+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.425+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: found 3 CUDA devices:
Aug 22 10:07:07 ollama[1983]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 22 10:07:07 ollama[1983]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 22 10:07:07 ollama[1983]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 22 10:07:07 ollama[1983]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 22 10:07:07 ollama[1983]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.735+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 22 10:07:31 ollama[1983]: time=2025-08-22T10:07:31.713+03:00 level=INFO source=server.go:1272 msg="llama runner started in 26.00 seconds"
Aug 22 10:07:36 ollama[1983]: [GIN] 2025/08/22 - 10:07:36 | 200 | 31.886988523s |  192.168.127.20 | POST     "/api/chat"
Aug 22 10:07:39 ollama[1983]: [GIN] 2025/08/22 - 10:07:39 | 200 |  2.854969906s |  192.168.127.20 | POST     "/api/chat"
Image

PS Notice model thought block inserted without tags

With OLLAMA_NEW_ESTIMATES = 1

Started ollama.service - Ollama Service.
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.283+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 22 10:11:06 ollama[8313]: time=2025-08-22T10:11:06.883+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42905"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.230+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.231+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:42905"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.591+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: found 3 CUDA devices:
Aug 22 10:11:07 ollama[8313]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 22 10:11:07 ollama[8313]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 22 10:11:07 ollama[8313]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 22 10:11:07 ollama[8313]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 22 10:11:07 ollama[8313]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.768+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.874+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.910+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 22 10:11:24 ollama[8313]: time=2025-08-22T10:11:24.551+03:00 level=INFO source=server.go:1272 msg="llama runner started in 17.33 seconds"
Aug 22 10:11:36 ollama[8313]: [GIN] 2025/08/22 - 10:11:36 | 200 | 30.393176227s |  192.168.127.20 | POST     "/api/chat"
Aug 22 10:12:31 ollama[8313]: [GIN] 2025/08/22 - 10:12:31 | 200 | 55.275684061s |  192.168.127.20 | POST     "/api/chat"
Image
<!-- gh-comment-id:3213347666 --> @ka-admin commented on GitHub (Aug 22, 2025): sure **With OLLAMA_NEW_ESTIMATES = 0** ``` journalctl -u ollama --no-pager --follow --pager-end Aug 22 09:39:04 systemd[1]: Started ollama.service - Ollama Service. Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.089+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.108+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.109+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 12.525484ms | 192.168.127.20 | GET "/api/tags" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 44.403µs | 192.168.127.20 | GET "/api/ps" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 50.274µs | 192.168.127.20 | GET "/api/version" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 1.057968ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 8.275µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 32.922µs | 192.168.127.20 | GET "/api/version" Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 | 1.062437ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 | 9.428µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 | 1.186741ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 | 13.305µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 | 1.47887ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 | 8.576µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:10 ollama[1983]: [GIN] 2025/08/22 - 10:05:10 | 200 | 33.072µs | 192.168.127.20 | GET "/api/version" Aug 22 10:06:48 ollama[1983]: [GIN] 2025/08/22 - 10:06:48 | 200 | 36.148µs | 192.168.127.20 | GET "/api/version" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.703+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 43039" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:43039" Aug 22 10:07:06 ollama[1983]: time=2025-08-22T10:07:06.039+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.378+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.387+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.425+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: found 3 CUDA devices: Aug 22 10:07:07 ollama[1983]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 22 10:07:07 ollama[1983]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 22 10:07:07 ollama[1983]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 22 10:07:07 ollama[1983]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 22 10:07:07 ollama[1983]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.735+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 22 10:07:31 ollama[1983]: time=2025-08-22T10:07:31.713+03:00 level=INFO source=server.go:1272 msg="llama runner started in 26.00 seconds" Aug 22 10:07:36 ollama[1983]: [GIN] 2025/08/22 - 10:07:36 | 200 | 31.886988523s | 192.168.127.20 | POST "/api/chat" Aug 22 10:07:39 ollama[1983]: [GIN] 2025/08/22 - 10:07:39 | 200 | 2.854969906s | 192.168.127.20 | POST "/api/chat" ``` <img width="1004" height="322" alt="Image" src="https://github.com/user-attachments/assets/4149ae3d-53f2-41ce-ad9a-a13de2de93f0" /> PS Notice model thought block inserted without <think> tags **With OLLAMA_NEW_ESTIMATES = 1** ``` Started ollama.service - Ollama Service. Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.283+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 22 10:11:06 ollama[8313]: time=2025-08-22T10:11:06.883+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42905" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38 Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.230+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.231+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:42905" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.591+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: found 3 CUDA devices: Aug 22 10:11:07 ollama[8313]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 22 10:11:07 ollama[8313]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 22 10:11:07 ollama[8313]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 22 10:11:07 ollama[8313]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 22 10:11:07 ollama[8313]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.768+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.874+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.910+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 22 10:11:24 ollama[8313]: time=2025-08-22T10:11:24.551+03:00 level=INFO source=server.go:1272 msg="llama runner started in 17.33 seconds" Aug 22 10:11:36 ollama[8313]: [GIN] 2025/08/22 - 10:11:36 | 200 | 30.393176227s | 192.168.127.20 | POST "/api/chat" Aug 22 10:12:31 ollama[8313]: [GIN] 2025/08/22 - 10:12:31 | 200 | 55.275684061s | 192.168.127.20 | POST "/api/chat" ``` <img width="990" height="540" alt="Image" src="https://github.com/user-attachments/assets/e1b233f5-666e-41cf-83a9-1770c9ad63dc" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7955