[GH-ISSUE #11982] ollama 0.11.5: gpt-oss120b produce nonsense output (or no output at all) #7955

New Issue

GiteaMirror · 2026-04-12T20:07:45-05:00

GiteaMirror commented

2026-04-12 20:07:45 -05:00

Originally created by @ka-admin on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11982

What is the issue?

When using ollama 0.11.5 with gpt-oss120b and OLLAMA_NEW_ESTIMATES=1 the LLM produces nonsense output or no output at all (in Open-WebUI). I've noticed this behaviour in RC builds and release still has it issue. Disabling OLLAMA_NEW_ESTIMATES fix the problem.

Relevant log output

journalctl -u ollama --no-pager --follow --pager-end
Aug 20 09:32:38 systemd[1]: Started ollama.service - Ollama Service.
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.396+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)"
Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 |   12.401326ms |  192.168.127.20 | GET      "/api/tags"
Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 |     669.615µs |  192.168.127.20 | GET      "/api/ps"
Aug 20 09:35:25 ollama[1914]: [GIN] 2025/08/20 - 09:35:25 | 200 |      57.589µs |  192.168.127.20 | GET      "/api/version"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.220+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36793"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.577+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.578+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:36793"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="173.0 GiB" free_swap="8.0 GiB"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.905+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.942+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: found 3 CUDA devices:
Aug 20 09:36:23 ollama[1914]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 20 09:36:23 ollama[1914]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 20 09:36:23 ollama[1914]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 20 09:36:23 ollama[1914]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 20 09:36:23 ollama[1914]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.221+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.335+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.374+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.721+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.729+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 20 09:36:41 ollama[1914]: time=2025-08-20T09:36:41.522+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.96 seconds"
Aug 20 09:36:51 ollama[1914]: [GIN] 2025/08/20 - 09:36:51 | 200 |      53.241µs |  192.168.127.20 | GET      "/api/version"
Aug 20 09:38:05 ollama[1914]: [GIN] 2025/08/20 - 09:38:05 | 200 |         1m43s |  192.168.127.20 | POST     "/api/chat"
Aug 20 09:39:56 ollama[1914]: [GIN] 2025/08/20 - 09:39:56 | 200 |         1m51s |  192.168.127.20 | POST     "/api/chat"



ollama run gpt-oss:120b
>>> Hi, introduce yourself
The sun was low over the river when Jake pushed his boat into the water. He had been at the front for three years, and the only thing he remembered now was the way the current tugged at the oars and the quiet
sound of water against wood.

He rowed out past the old stone bridge, past the fields that had once been green and were now scarred with craters. The wind smelled of wheat and smoke. He thought of Maria, of the night they had kissed under the
low lamp in the village, before the night fell and the bombs began.

He let the oars rest for a moment, feeling the rhythm of the river beneath his hands. The water was calm, but his mind was not. He saw the faces of his comrades, the hollow eyes of men who had watched the world
burn, and the child he had left behind in the hills.

When the sun finally sank behind the hills, Jake turned the boat back toward the shore. He could hear the distant rumble of artillery, but the river carried his thoughts away. He knew that some wounds never close,
and some nights never end, but he also knew that a man could sit in a boat and still feel the pulse of life in the quiet flow of water.

He pulled the boat up onto the sand, stepped out, and walked toward the town, his boots leaving faint marks in the dust. In the distance, a dog barked. He smiled, a small, tired smile, and kept walking, his heart
beating in time with the river he left behind.

>>> Send a message (/? for help)


I'm using Ubuntu Server 25.04 with AMD Ryzen 9 7950x and 3 GPU: 2x 4090 RTX + Tesla V100 SXM2 32GB. 192Gb RAM

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.5

Originally created by @ka-admin on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11982 ### What is the issue? When using ollama 0.11.5 with gpt-oss120b and OLLAMA_NEW_ESTIMATES=1 the LLM produces nonsense output or no output at all (in Open-WebUI). I've noticed this behaviour in RC builds and release still has it issue. Disabling OLLAMA_NEW_ESTIMATES fix the problem. ### Relevant log output ```shell journalctl -u ollama --no-pager --follow --pager-end Aug 20 09:32:38 systemd[1]: Started ollama.service - Ollama Service. Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.396+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.416+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)" Aug 20 09:32:38 ollama[1914]: time=2025-08-20T09:32:38.417+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 20 09:32:40 ollama[1914]: time=2025-08-20T09:32:40.868+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 | 12.401326ms | 192.168.127.20 | GET "/api/tags" Aug 20 09:35:15 ollama[1914]: [GIN] 2025/08/20 - 09:35:15 | 200 | 669.615µs | 192.168.127.20 | GET "/api/ps" Aug 20 09:35:25 ollama[1914]: [GIN] 2025/08/20 - 09:35:25 | 200 | 57.589µs | 192.168.127.20 | GET "/api/version" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.220+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36793" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.563+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38 Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.577+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.578+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:36793" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="173.0 GiB" free_swap="8.0 GiB" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.897+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.905+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:22 ollama[1914]: time=2025-08-20T09:36:22.942+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 20 09:36:23 ollama[1914]: ggml_cuda_init: found 3 CUDA devices: Aug 20 09:36:23 ollama[1914]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 20 09:36:23 ollama[1914]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 20 09:36:23 ollama[1914]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 20 09:36:23 ollama[1914]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 20 09:36:23 ollama[1914]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.221+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.335+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.374+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.721+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.722+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.728+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 20 09:36:23 ollama[1914]: time=2025-08-20T09:36:23.729+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 20 09:36:41 ollama[1914]: time=2025-08-20T09:36:41.522+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.96 seconds" Aug 20 09:36:51 ollama[1914]: [GIN] 2025/08/20 - 09:36:51 | 200 | 53.241µs | 192.168.127.20 | GET "/api/version" Aug 20 09:38:05 ollama[1914]: [GIN] 2025/08/20 - 09:38:05 | 200 | 1m43s | 192.168.127.20 | POST "/api/chat" Aug 20 09:39:56 ollama[1914]: [GIN] 2025/08/20 - 09:39:56 | 200 | 1m51s | 192.168.127.20 | POST "/api/chat" ollama run gpt-oss:120b >>> Hi, introduce yourself The sun was low over the river when Jake pushed his boat into the water. He had been at the front for three years, and the only thing he remembered now was the way the current tugged at the oars and the quiet sound of water against wood. He rowed out past the old stone bridge, past the fields that had once been green and were now scarred with craters. The wind smelled of wheat and smoke. He thought of Maria, of the night they had kissed under the low lamp in the village, before the night fell and the bombs began. He let the oars rest for a moment, feeling the rhythm of the river beneath his hands. The water was calm, but his mind was not. He saw the faces of his comrades, the hollow eyes of men who had watched the world burn, and the child he had left behind in the hills. When the sun finally sank behind the hills, Jake turned the boat back toward the shore. He could hear the distant rumble of artillery, but the river carried his thoughts away. He knew that some wounds never close, and some nights never end, but he also knew that a man could sit in a boat and still feel the pulse of life in the quiet flow of water. He pulled the boat up onto the sand, stepped out, and walked toward the town, his boots leaving faint marks in the dust. In the distance, a dog barked. He smiled, a small, tired smile, and kept walking, his heart beating in time with the river he left behind. >>> Send a message (/? for help) I'm using Ubuntu Server 25.04 with AMD Ryzen 9 7950x and 3 GPU: 2x 4090 RTX + Tesla V100 SXM2 32GB. 192Gb RAM ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.5

GiteaMirror added the bug label 2026-04-12 20:07:45 -05:00

GiteaMirror commented

2026-04-12 20:07:46 -05:00

@jessegross commented on GitHub (Aug 20, 2025):

Can you please post the logs with OLLAMA_NEW_ESTIMATES off?

@jessegross commented on GitHub (Aug 20, 2025): Can you please post the logs with OLLAMA_NEW_ESTIMATES off?

GiteaMirror commented

2026-04-12 20:07:47 -05:00

@ka-admin commented on GitHub (Aug 21, 2025):

 Started ollama.service - Ollama Service.
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.591+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.593+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)"
Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 |       54.06µs |       127.0.0.1 | HEAD     "/"
Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 |   72.204534ms |       127.0.0.1 | POST     "/api/show"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45539"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:45539"
Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.438+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.8 GiB" free_swap="8.0 GiB"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.410+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 library=cuda parallel=1 required="70.9 GiB" gpus=3
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.729+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="[15 11 11]" memory.available="[31.4 GiB 23.1 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="70.9 GiB" memory.required.partial="70.9 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[27.6 GiB 21.6 GiB 21.6 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.730+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.764+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: found 3 CUDA devices:
Aug 21 09:10:36 ollama[6191]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 21 09:10:36 ollama[6191]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 21 09:10:36 ollama[6191]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 21 09:10:36 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 21 09:10:36 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.940+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 21 09:10:53 ollama[6191]: time=2025-08-21T09:10:53.147+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.03 seconds"
Aug 21 09:10:53 ollama[6191]: [GIN] 2025/08/21 - 09:10:53 | 200 | 18.925510525s |       127.0.0.1 | POST     "/api/generate"
Aug 21 09:11:16 ollama[6191]: [GIN] 2025/08/21 - 09:11:16 | 200 |  551.935404ms |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:20 ollama[6191]: [GIN] 2025/08/21 - 09:11:20 | 200 |  257.347022ms |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:23 ollama[6191]: [GIN] 2025/08/21 - 09:11:23 | 200 |   1.21404796s |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:11:44 ollama[6191]: [GIN] 2025/08/21 - 09:11:44 | 200 |  17.74461018s |       127.0.0.1 | POST     "/api/chat"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |   11.433904ms |  192.168.127.20 | GET      "/api/tags"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |        42.5µs |  192.168.127.20 | GET      "/api/ps"
Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 |      49.152µs |  192.168.127.20 | GET      "/api/version"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |      34.234µs |  192.168.127.20 | GET      "/api/version"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |     982.593µs |  192.168.127.20 | GET      "/api/tags"
Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 |      12.784µs |  192.168.127.20 | GET      "/api/ps"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34605"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.691+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.692+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34605"
Aug 21 09:14:37 ollama[6191]: time=2025-08-21T09:14:37.003+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="174.9 GiB" free_swap="8.0 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.293+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.0 GiB 23.1 GiB 31.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.294+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.334+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: found 3 CUDA devices:
Aug 21 09:14:38 ollama[6191]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 21 09:14:38 ollama[6191]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 21 09:14:38 ollama[6191]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 21 09:14:38 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 21 09:14:38 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.521+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 21 09:14:58 ollama[6191]: time=2025-08-21T09:14:58.231+03:00 level=INFO source=server.go:1272 msg="llama runner started in 21.55 seconds"
Aug 21 09:15:06 ollama[6191]: [GIN] 2025/08/21 - 09:15:06 | 200 | 33.234048006s |  192.168.127.20 | POST     "/api/chat"
Aug 21 09:15:11 ollama[6191]: [GIN] 2025/08/21 - 09:15:11 | 200 |   5.62881336s |  192.168.127.20 | POST     "/api/chat"

It seems to have nothing to do with OLLAMA_NEW_ESTIMATES (at least in ollama console chat). I'm apologize

ollama run gpt-oss:120b
>>> Hi, introduce yourself


>>> Hi, introduce yourself


>>> Hi, introduce yourself


>>> Hi
Below is a **step‑by‑step explanation** of why each element in the process is needed and what purpose it serves.
The format is *“Step N – What is done → Why it is necessary”* so you can see the reasoning behind every action.

---

### Step 1 – Identify the Goal
**What is done:** Clearly define what you want to achieve (e.g., solve a problem, build a feature, answer a question).
**Why it is necessary:** Without a concrete goal you can’t decide which actions are relevant, you may waste effort on unrelated work, and you won’t have a way to measure success.

### Step 2 – Gather Requirements / Constraints
**What is done:** Collect all functional and non‑functional requirements, inputs, outputs, performance limits, security policies, etc.
**Why it is necessary:** Requirements tell you **what** the solution must do and **how** it must behave. Ignoring them leads to a solution that either fails to meet user needs or breaks system rules.

### Step 3 – Perform a High‑Level Analysis
**What is done:** Break the problem into major components or phases (e.g., input processing, core logic, output generation).
**Why it is necessary:** A high‑level view helps you see the overall structure, spot dependencies early, and avoid “tunnel vision” on a single part while forgetting the rest.

### Step 4 – Choose an Architecture / Design Pattern
**What is done:** Select a suitable architectural style (layered, micro‑services, event‑driven, etc.) or design pattern (Factory, Observer, MVC, etc.).
**Why it is necessary:** Good architecture ensures scalability, maintainability, and testability. A pattern provides a proven solution to a recurring problem, reducing the chance of hidden bugs.

### Step 5 – Create Detailed Design (Algorithms & Data Structures)
**What is done:** Specify exact algorithms, data structures, class diagrams, API contracts, and interaction flows.
**Why it is necessary:** Details turn abstract ideas into implementable code. Choosing the right algorithm/data structure can drastically affect performance and memory usage.

### Step 6 – Validate the Design (Peer Review / Walkthrough)
**What is done:** Have teammates review the design, run design‑by‑example scenarios, or create quick prototypes.
**Why it is necessary:** Early feedback catches logical errors, ambiguous specifications, and unrealistic assumptions before any code is written, saving time later.

### Step 7 – Set Up the Development Environment
**What is done:** Install compilers, libraries, CI pipelines, version‑control hooks, and configure build scripts.
**Why it is necessary:** A consistent environment prevents “works on my machine” problems, enables automated testing, and speeds up iteration.

### Step 8 – Implement the Solution Incrementally
**What is done:** Write code in small, testable units (functions, classes, modules) and commit frequently.
**Why it is necessary:** Small increments make it easier to locate defects, allow continuous integration, and let you get early feedback from tests or users.

### Step 9 – Write Automated Tests (Unit / Integration / Acceptance)
**What is done:** Create tests that verify each component works in isolation and together as a system.
**Why it is necessary:** Tests provide a safety net against regressions, document expected behavior, and give confidence when refactoring or extending the code.

### Step 10 – Run Static Analysis / Code Review
**What is done:** Use linters, type checkers, security scanners, and have peers review the code.
**Why it is necessary:** Static checks catch style violations, potential bugs, and security issues that tests might miss, improving overall code quality.

### Step 11 – Perform Performance & Load Testing
**What is done:** Measure latency, throughput, memory consumption under realistic workloads.
**Why it is necessary:** Even a functionally correct system can fail in production if it can’t handle the expected load or meets performance SLAs.

### Step 12 – Deploy to a Staging Environment
**What is done:** Push the build to an environment that mirrors production (same configuration, databases, network).
**Why it is necessary:** Staging validates that deployment scripts, environment variables, and external integrations work correctly before affecting real users.

### Step 13 – Conduct Acceptance Testing / User Validation
**What is done:** Let stakeholders or a beta‑user group interact with the system and confirm it meets their needs.
**Why it is necessary:** Stakeholder sign‑off ensures the product solves the real problem; hidden usability issues are often discovered only at this stage.

### Step 14 – Deploy to Production
**What is done:** Release the vetted code to the live environment, often using a rollout strategy (blue‑green, canary, rolling).
**Why it is necessary:** A controlled rollout minimizes risk; if something goes wrong you can quickly revert or isolate the issue.

### Step 15 – Monitor & Observe in Production
**What is done:** Collect logs, metrics, traces, and alerts to track health and performance.
**Why it is necessary:** Monitoring detects anomalies early, enabling rapid response before users are impacted; it also provides data for future improvements.

### Step 16 – Gather Feedback & Iterate
**What is done:** Analyze user feedback, error reports, and performance data; plan enhancements or bug fixes.
**Why it is necessary:** Software is never truly finished; continuous improvement keeps the solution relevant, secure, and efficient over time.

---

#### Summary

| # | Step | Core Reason |
|---|------|--------------|
| 1 | Define Goal | Gives direction & success criteria |
| 2 | Gather Requirements | Ensures solution meets real needs |
| 3 | High‑Level Analysis | Reveals overall structure & dependencies |
| 4 | Choose Architecture | Guarantees scalability & maintainability |
| 5 | Detailed Design | Turns concepts into concrete, efficient code |
| 6 | Validate Design | Catches mistakes before coding |
| 7 | Set Up Env. | Prevents environment‑related bugs |
| 8 | Incremental Implementation | Makes debugging & integration easier |
| 9 | Automated Tests | Protects against regressions |
|10| Static Analysis / Review | Improves code quality & security |
|11| Performance Testing | Verifies the system can handle load |
|12| Staging Deployment | Confirms deployment process works |
|13| Acceptance Testing | Confirms it solves the real problem |
|14| Production Deploy | Delivers value to end users |
|15| Monitoring | Detects issues early in live use |
|16| Feedback Loop | Drives continuous improvement |

Following these steps, each **necessary** because it either prevents future problems, validates correctness, or ensures the product delivers real value, leads to a robust, maintainable, and user‑focused solution.

>>>

But it works in Open-WebUI v0.6.22[(latest). Although there still some format mess especially with thinking block, but it could be Open WebUI problem not Ollama's output.

@ka-admin commented on GitHub (Aug 21, 2025): ``` Started ollama.service - Ollama Service. Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.591+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.593+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.5-rc5)" Aug 21 09:10:25 ollama[6191]: time=2025-08-21T09:10:25.594+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 21 09:10:26 ollama[6191]: time=2025-08-21T09:10:26.079+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 | 54.06µs | 127.0.0.1 | HEAD "/" Aug 21 09:10:34 ollama[6191]: [GIN] 2025/08/21 - 09:10:34 | 200 | 72.204534ms | 127.0.0.1 | POST "/api/show" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.119+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45539" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.127+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:45539" Aug 21 09:10:35 ollama[6191]: time=2025-08-21T09:10:35.438+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.8 GiB" free_swap="8.0 GiB" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.410+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 library=cuda parallel=1 required="70.9 GiB" gpus=3 Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.729+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=37 layers.split="[15 11 11]" memory.available="[31.4 GiB 23.1 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="70.9 GiB" memory.required.partial="70.9 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[27.6 GiB 21.6 GiB 21.6 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.730+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.764+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 21 09:10:36 ollama[6191]: ggml_cuda_init: found 3 CUDA devices: Aug 21 09:10:36 ollama[6191]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 21 09:10:36 ollama[6191]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 21 09:10:36 ollama[6191]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 21 09:10:36 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 21 09:10:36 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 21 09:10:36 ollama[6191]: time=2025-08-21T09:10:36.940+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.103+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 21 09:10:37 ollama[6191]: time=2025-08-21T09:10:37.104+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 21 09:10:53 ollama[6191]: time=2025-08-21T09:10:53.147+03:00 level=INFO source=server.go:1272 msg="llama runner started in 18.03 seconds" Aug 21 09:10:53 ollama[6191]: [GIN] 2025/08/21 - 09:10:53 | 200 | 18.925510525s | 127.0.0.1 | POST "/api/generate" Aug 21 09:11:16 ollama[6191]: [GIN] 2025/08/21 - 09:11:16 | 200 | 551.935404ms | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:20 ollama[6191]: [GIN] 2025/08/21 - 09:11:20 | 200 | 257.347022ms | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:23 ollama[6191]: [GIN] 2025/08/21 - 09:11:23 | 200 | 1.21404796s | 127.0.0.1 | POST "/api/chat" Aug 21 09:11:44 ollama[6191]: [GIN] 2025/08/21 - 09:11:44 | 200 | 17.74461018s | 127.0.0.1 | POST "/api/chat" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 11.433904ms | 192.168.127.20 | GET "/api/tags" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 42.5µs | 192.168.127.20 | GET "/api/ps" Aug 21 09:14:12 ollama[6191]: [GIN] 2025/08/21 - 09:14:12 | 200 | 49.152µs | 192.168.127.20 | GET "/api/version" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 34.234µs | 192.168.127.20 | GET "/api/version" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 982.593µs | 192.168.127.20 | GET "/api/tags" Aug 21 09:14:15 ollama[6191]: [GIN] 2025/08/21 - 09:14:15 | 200 | 12.784µs | 192.168.127.20 | GET "/api/ps" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.684+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34605" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.691+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 21 09:14:36 ollama[6191]: time=2025-08-21T09:14:36.692+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34605" Aug 21 09:14:37 ollama[6191]: time=2025-08-21T09:14:37.003+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="174.9 GiB" free_swap="8.0 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.293+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.0 GiB 23.1 GiB 31.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.294+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.334+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 21 09:14:38 ollama[6191]: ggml_cuda_init: found 3 CUDA devices: Aug 21 09:14:38 ollama[6191]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 21 09:14:38 ollama[6191]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 21 09:14:38 ollama[6191]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 21 09:14:38 ollama[6191]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 21 09:14:38 ollama[6191]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.521+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.932+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 21 09:14:38 ollama[6191]: time=2025-08-21T09:14:38.933+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 21 09:14:58 ollama[6191]: time=2025-08-21T09:14:58.231+03:00 level=INFO source=server.go:1272 msg="llama runner started in 21.55 seconds" Aug 21 09:15:06 ollama[6191]: [GIN] 2025/08/21 - 09:15:06 | 200 | 33.234048006s | 192.168.127.20 | POST "/api/chat" Aug 21 09:15:11 ollama[6191]: [GIN] 2025/08/21 - 09:15:11 | 200 | 5.62881336s | 192.168.127.20 | POST "/api/chat" ``` It seems to have nothing to do with OLLAMA_NEW_ESTIMATES (at least in ollama console chat). I'm apologize ``` ollama run gpt-oss:120b >>> Hi, introduce yourself >>> Hi, introduce yourself >>> Hi, introduce yourself >>> Hi Below is a **step‑by‑step explanation** of why each element in the process is needed and what purpose it serves. The format is *“Step N – What is done → Why it is necessary”* so you can see the reasoning behind every action. --- ### Step 1 – Identify the Goal **What is done:** Clearly define what you want to achieve (e.g., solve a problem, build a feature, answer a question). **Why it is necessary:** Without a concrete goal you can’t decide which actions are relevant, you may waste effort on unrelated work, and you won’t have a way to measure success. ### Step 2 – Gather Requirements / Constraints **What is done:** Collect all functional and non‑functional requirements, inputs, outputs, performance limits, security policies, etc. **Why it is necessary:** Requirements tell you **what** the solution must do and **how** it must behave. Ignoring them leads to a solution that either fails to meet user needs or breaks system rules. ### Step 3 – Perform a High‑Level Analysis **What is done:** Break the problem into major components or phases (e.g., input processing, core logic, output generation). **Why it is necessary:** A high‑level view helps you see the overall structure, spot dependencies early, and avoid “tunnel vision” on a single part while forgetting the rest. ### Step 4 – Choose an Architecture / Design Pattern **What is done:** Select a suitable architectural style (layered, micro‑services, event‑driven, etc.) or design pattern (Factory, Observer, MVC, etc.). **Why it is necessary:** Good architecture ensures scalability, maintainability, and testability. A pattern provides a proven solution to a recurring problem, reducing the chance of hidden bugs. ### Step 5 – Create Detailed Design (Algorithms & Data Structures) **What is done:** Specify exact algorithms, data structures, class diagrams, API contracts, and interaction flows. **Why it is necessary:** Details turn abstract ideas into implementable code. Choosing the right algorithm/data structure can drastically affect performance and memory usage. ### Step 6 – Validate the Design (Peer Review / Walkthrough) **What is done:** Have teammates review the design, run design‑by‑example scenarios, or create quick prototypes. **Why it is necessary:** Early feedback catches logical errors, ambiguous specifications, and unrealistic assumptions before any code is written, saving time later. ### Step 7 – Set Up the Development Environment **What is done:** Install compilers, libraries, CI pipelines, version‑control hooks, and configure build scripts. **Why it is necessary:** A consistent environment prevents “works on my machine” problems, enables automated testing, and speeds up iteration. ### Step 8 – Implement the Solution Incrementally **What is done:** Write code in small, testable units (functions, classes, modules) and commit frequently. **Why it is necessary:** Small increments make it easier to locate defects, allow continuous integration, and let you get early feedback from tests or users. ### Step 9 – Write Automated Tests (Unit / Integration / Acceptance) **What is done:** Create tests that verify each component works in isolation and together as a system. **Why it is necessary:** Tests provide a safety net against regressions, document expected behavior, and give confidence when refactoring or extending the code. ### Step 10 – Run Static Analysis / Code Review **What is done:** Use linters, type checkers, security scanners, and have peers review the code. **Why it is necessary:** Static checks catch style violations, potential bugs, and security issues that tests might miss, improving overall code quality. ### Step 11 – Perform Performance & Load Testing **What is done:** Measure latency, throughput, memory consumption under realistic workloads. **Why it is necessary:** Even a functionally correct system can fail in production if it can’t handle the expected load or meets performance SLAs. ### Step 12 – Deploy to a Staging Environment **What is done:** Push the build to an environment that mirrors production (same configuration, databases, network). **Why it is necessary:** Staging validates that deployment scripts, environment variables, and external integrations work correctly before affecting real users. ### Step 13 – Conduct Acceptance Testing / User Validation **What is done:** Let stakeholders or a beta‑user group interact with the system and confirm it meets their needs. **Why it is necessary:** Stakeholder sign‑off ensures the product solves the real problem; hidden usability issues are often discovered only at this stage. ### Step 14 – Deploy to Production **What is done:** Release the vetted code to the live environment, often using a rollout strategy (blue‑green, canary, rolling). **Why it is necessary:** A controlled rollout minimizes risk; if something goes wrong you can quickly revert or isolate the issue. ### Step 15 – Monitor & Observe in Production **What is done:** Collect logs, metrics, traces, and alerts to track health and performance. **Why it is necessary:** Monitoring detects anomalies early, enabling rapid response before users are impacted; it also provides data for future improvements. ### Step 16 – Gather Feedback & Iterate **What is done:** Analyze user feedback, error reports, and performance data; plan enhancements or bug fixes. **Why it is necessary:** Software is never truly finished; continuous improvement keeps the solution relevant, secure, and efficient over time. --- #### Summary | # | Step | Core Reason | |---|------|--------------| | 1 | Define Goal | Gives direction & success criteria | | 2 | Gather Requirements | Ensures solution meets real needs | | 3 | High‑Level Analysis | Reveals overall structure & dependencies | | 4 | Choose Architecture | Guarantees scalability & maintainability | | 5 | Detailed Design | Turns concepts into concrete, efficient code | | 6 | Validate Design | Catches mistakes before coding | | 7 | Set Up Env. | Prevents environment‑related bugs | | 8 | Incremental Implementation | Makes debugging & integration easier | | 9 | Automated Tests | Protects against regressions | |10| Static Analysis / Review | Improves code quality & security | |11| Performance Testing | Verifies the system can handle load | |12| Staging Deployment | Confirms deployment process works | |13| Acceptance Testing | Confirms it solves the real problem | |14| Production Deploy | Delivers value to end users | |15| Monitoring | Detects issues early in live use | |16| Feedback Loop | Drives continuous improvement | Following these steps, each **necessary** because it either prevents future problems, validates correctness, or ensures the product delivers real value, leads to a robust, maintainable, and user‑focused solution. >>> ``` But it works in Open-WebUI v0.6.22[(latest). Although there still some format mess especially with thinking block, but it could be Open WebUI problem not Ollama's output. <img width="1027" height="526" alt="Image" src="https://github.com/user-attachments/assets/ce388746-49a1-4ee0-9e21-8d2405bb8e1a" />

GiteaMirror commented

2026-04-12 20:07:47 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

Thanks for the update that it is not related to OLLAMA_NEW_ESTIMATES. Possibly Open WebUI is calling Ollama with different settings that avoid the problem. Can you post the log with Open WebUI when it works fine?

@jessegross commented on GitHub (Aug 21, 2025): Thanks for the update that it is not related to OLLAMA_NEW_ESTIMATES. Possibly Open WebUI is calling Ollama with different settings that avoid the problem. Can you post the log with Open WebUI when it works fine?

GiteaMirror commented

2026-04-12 20:07:47 -05:00

@ka-admin commented on GitHub (Aug 22, 2025):

sure

With OLLAMA_NEW_ESTIMATES = 0


journalctl -u ollama --no-pager --follow --pager-end
Aug 22 09:39:04 systemd[1]: Started ollama.service - Ollama Service.
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.089+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.108+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)"
Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.109+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |   12.525484ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |      44.403µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 |      50.274µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |    1.057968ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |       8.275µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 |      32.922µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 |    1.062437ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 |       9.428µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 |    1.186741ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 |      13.305µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 |     1.47887ms |  192.168.127.20 | GET      "/api/tags"
Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 |       8.576µs |  192.168.127.20 | GET      "/api/ps"
Aug 22 10:05:10 ollama[1983]: [GIN] 2025/08/22 - 10:05:10 | 200 |      33.072µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:06:48 ollama[1983]: [GIN] 2025/08/22 - 10:06:48 | 200 |      36.148µs |  192.168.127.20 | GET      "/api/version"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.703+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 43039"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:43039"
Aug 22 10:07:06 ollama[1983]: time=2025-08-22T10:07:06.039+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.378+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.387+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.425+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: found 3 CUDA devices:
Aug 22 10:07:07 ollama[1983]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 22 10:07:07 ollama[1983]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 22 10:07:07 ollama[1983]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 22 10:07:07 ollama[1983]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 22 10:07:07 ollama[1983]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.735+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 22 10:07:31 ollama[1983]: time=2025-08-22T10:07:31.713+03:00 level=INFO source=server.go:1272 msg="llama runner started in 26.00 seconds"
Aug 22 10:07:36 ollama[1983]: [GIN] 2025/08/22 - 10:07:36 | 200 | 31.886988523s |  192.168.127.20 | POST     "/api/chat"
Aug 22 10:07:39 ollama[1983]: [GIN] 2025/08/22 - 10:07:39 | 200 |  2.854969906s |  192.168.127.20 | POST     "/api/chat"

PS Notice model thought block inserted without tags

With OLLAMA_NEW_ESTIMATES = 1

Started ollama.service - Ollama Service.
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.283+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:477 msg="total blobs: 68"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Aug 22 10:11:06 ollama[8313]: time=2025-08-22T10:11:06.883+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=INFO source=server.go:211 msg="enabling flash attention"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42905"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.230+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.231+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:42905"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.591+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: found 3 CUDA devices:
Aug 22 10:11:07 ollama[8313]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 22 10:11:07 ollama[8313]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 22 10:11:07 ollama[8313]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 22 10:11:07 ollama[8313]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 22 10:11:07 ollama[8313]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.768+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.874+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.910+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
Aug 22 10:11:24 ollama[8313]: time=2025-08-22T10:11:24.551+03:00 level=INFO source=server.go:1272 msg="llama runner started in 17.33 seconds"
Aug 22 10:11:36 ollama[8313]: [GIN] 2025/08/22 - 10:11:36 | 200 | 30.393176227s |  192.168.127.20 | POST     "/api/chat"
Aug 22 10:12:31 ollama[8313]: [GIN] 2025/08/22 - 10:12:31 | 200 | 55.275684061s |  192.168.127.20 | POST     "/api/chat"

@ka-admin commented on GitHub (Aug 22, 2025): sure **With OLLAMA_NEW_ESTIMATES = 0** ``` journalctl -u ollama --no-pager --follow --pager-end Aug 22 09:39:04 systemd[1]: Started ollama.service - Ollama Service. Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.089+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.107+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.108+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)" Aug 22 09:39:04 ollama[1983]: time=2025-08-22T09:39:04.109+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 09:39:06 ollama[1983]: time=2025-08-22T09:39:06.102+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 12.525484ms | 192.168.127.20 | GET "/api/tags" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 44.403µs | 192.168.127.20 | GET "/api/ps" Aug 22 09:52:29 ollama[1983]: [GIN] 2025/08/22 - 09:52:29 | 200 | 50.274µs | 192.168.127.20 | GET "/api/version" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 1.057968ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 8.275µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:04:53 ollama[1983]: [GIN] 2025/08/22 - 10:04:53 | 200 | 32.922µs | 192.168.127.20 | GET "/api/version" Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 | 1.062437ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:00 ollama[1983]: [GIN] 2025/08/22 - 10:05:00 | 200 | 9.428µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 | 1.186741ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:02 ollama[1983]: [GIN] 2025/08/22 - 10:05:02 | 200 | 13.305µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 | 1.47887ms | 192.168.127.20 | GET "/api/tags" Aug 22 10:05:08 ollama[1983]: [GIN] 2025/08/22 - 10:05:08 | 200 | 8.576µs | 192.168.127.20 | GET "/api/ps" Aug 22 10:05:10 ollama[1983]: [GIN] 2025/08/22 - 10:05:10 | 200 | 33.072µs | 192.168.127.20 | GET "/api/version" Aug 22 10:06:48 ollama[1983]: [GIN] 2025/08/22 - 10:06:48 | 200 | 36.148µs | 192.168.127.20 | GET "/api/version" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.703+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.710+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 43039" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 22 10:07:05 ollama[1983]: time=2025-08-22T10:07:05.719+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:43039" Aug 22 10:07:06 ollama[1983]: time=2025-08-22T10:07:06.039+03:00 level=INFO source=server.go:488 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.378+03:00 level=INFO source=server.go:531 msg=offload library=cuda layers.requested=38 layers.model=37 layers.offload=17 layers.split="[4 4 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="109.9 GiB" memory.required.partial="76.5 GiB" memory.required.kv="2.7 GiB" memory.required.allocations="[22.6 GiB 22.6 GiB 31.2 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="13.7 GiB" memory.graph.partial="13.7 GiB" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.387+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:9(0..8) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:9(9..17) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:19(18..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.425+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 22 10:07:07 ollama[1983]: ggml_cuda_init: found 3 CUDA devices: Aug 22 10:07:07 ollama[1983]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 22 10:07:07 ollama[1983]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 22 10:07:07 ollama[1983]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 22 10:07:07 ollama[1983]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 22 10:07:07 ollama[1983]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 22 10:07:07 ollama[1983]: time=2025-08-22T10:07:07.735+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="14.7 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="30.4 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="631.0 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="768.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.4 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="187.0 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="179.5 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 22 10:07:08 ollama[1983]: time=2025-08-22T10:07:08.159+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 22 10:07:31 ollama[1983]: time=2025-08-22T10:07:31.713+03:00 level=INFO source=server.go:1272 msg="llama runner started in 26.00 seconds" Aug 22 10:07:36 ollama[1983]: [GIN] 2025/08/22 - 10:07:36 | 200 | 31.886988523s | 192.168.127.20 | POST "/api/chat" Aug 22 10:07:39 ollama[1983]: [GIN] 2025/08/22 - 10:07:39 | 200 | 2.854969906s | 192.168.127.20 | POST "/api/chat" ``` <img width="1004" height="322" alt="Image" src="https://github.com/user-attachments/assets/4149ae3d-53f2-41ce-ad9a-a13de2de93f0" /> PS Notice model thought block inserted without <think> tags **With OLLAMA_NEW_ESTIMATES = 1** ``` Started ollama.service - Ollama Service. Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.283+03:00 level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:477 msg="total blobs: 68" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.285+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.286+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Aug 22 10:10:38 ollama[8313]: time=2025-08-22T10:10:38.728+03:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Aug 22 10:11:06 ollama[8313]: time=2025-08-22T10:11:06.883+03:00 level=INFO source=server.go:166 msg="enabling new memory estimates" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=INFO source=server.go:211 msg="enabling flash attention" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.222+03:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:383 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42905" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.223+03:00 level=INFO source=server.go:659 msg="loading model" "model layers"=37 requested=38 Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.230+03:00 level=INFO source=runner.go:1006 msg="starting ollama engine" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.231+03:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:42905" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:665 msg="system memory" total="184.1 GiB" free="175.5 GiB" free_swap="8.0 GiB" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=server.go:669 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.557+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.591+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 22 10:11:07 ollama[8313]: ggml_cuda_init: found 3 CUDA devices: Aug 22 10:11:07 ollama[8313]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 22 10:11:07 ollama[8313]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 22 10:11:07 ollama[8313]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 22 10:11:07 ollama[8313]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 22 10:11:07 ollama[8313]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.768+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.874+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:07 ollama[8313]: time=2025-08-22T10:11:07.910+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:75000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="777.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="924.0 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="179.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="179.5 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="187.0 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=backend.go:342 msg="total memory" size="64.1 GiB" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" Aug 22 10:11:08 ollama[8313]: time=2025-08-22T10:11:08.263+03:00 level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" Aug 22 10:11:24 ollama[8313]: time=2025-08-22T10:11:24.551+03:00 level=INFO source=server.go:1272 msg="llama runner started in 17.33 seconds" Aug 22 10:11:36 ollama[8313]: [GIN] 2025/08/22 - 10:11:36 | 200 | 30.393176227s | 192.168.127.20 | POST "/api/chat" Aug 22 10:12:31 ollama[8313]: [GIN] 2025/08/22 - 10:12:31 | 200 | 55.275684061s | 192.168.127.20 | POST "/api/chat" ``` <img width="990" height="540" alt="Image" src="https://github.com/user-attachments/assets/e1b233f5-666e-41cf-83a9-1770c9ad63dc" />

GiteaMirror referenced this issue

2026-04-22 10:48:11 -05:00

[GH-ISSUE #7955] Inconsistency between Ollama REST API and CLI Model List causing model accessibility issues #30853

GiteaMirror referenced this issue

2026-04-28 20:37:35 -05:00

[GH-ISSUE #7955] Inconsistency between Ollama REST API and CLI Model List causing model accessibility issues #51604

GiteaMirror referenced this issue

2026-05-04 09:33:08 -05:00

[GH-ISSUE #7955] Inconsistency between Ollama REST API and CLI Model List causing model accessibility issues #67149

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#7955