[GH-ISSUE #11744] GPT-OSS 120b 0.11.3 - OOM #69839

Closed
opened 2026-05-04 19:32:22 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @ka-admin on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11744

What is the issue?

Have 3 GPU:
2x 4090 RTX
1x 4070 RTX

GPT-OSS 120b:
num_batch = 256
num_gpu = 25
num_ctx = 32678

0.11.3 RC - running out of memory

the same config and same settings on 0.11.2 runs flawlessly

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 07 13:41:34  ollama[413395]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 07 13:41:34  ollama[413395]: ggml_cuda_init: found 3 CUDA devices:
Aug 07 13:41:34  ollama[413395]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Aug 07 13:41:34  ollama[413395]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Aug 07 13:41:34  ollama[413395]:   Device 2: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
Aug 07 13:41:34  ollama[413395]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.823+03:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
Aug 07 13:41:34  ollama[413395]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.824+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:367 msg="offloading 25 repeating layers to GPU"
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:378 msg="offloaded 25/37 layers to GPU"
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="20.1 GiB"
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="21.2 GiB"
Aug 07 13:41:34  ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA1 size="19.5 GiB"
Aug 07 13:41:35  ollama[413395]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4157.88 MiB on device 0: cudaMalloc failed: out of memory
Aug 07 13:41:35  ollama[413395]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4359848704
Aug 07 13:41:35  ollama[413395]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4146.94 MiB on device 1: cudaMalloc failed: out of memory
Aug 07 13:41:35  ollama[413395]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 4348380416
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="4.1 GiB"
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="4.0 GiB"
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B"
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB"
Aug 07 13:41:35  ollama[413395]: panic: insufficient memory - required allocations: {InputWeights:1158266880A CPU:{Name:CPU ID: Weights:[1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1158278400A] Cache:[8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4310368256A} GPUs:[{Name:CUDA0 ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4359848704F} {Name:CUDA1 ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 0U] Graph:4348380416F} {Name:CUDA2 ID:GPU-515077ee-833a-270f-3392-dbfdb7c08c51 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A}]}
Aug 07 13:41:35  ollama[413395]: goroutine 16 [running]:
Aug 07 13:41:35  ollama[413395]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc0015bb080)
Aug 07 13:41:35  ollama[413395]:         github.com/ollama/ollama/ml/backend/ggml/ggml.go:677 +0x756
Aug 07 13:41:35  ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0001e2480)
Aug 07 13:41:35  ollama[413395]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd
Aug 07 13:41:35  ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0001e2480, {0x7ffe93744b6e?, 0x0?}, {0x10, 0x0, 0x19, {0xc0001ff7c0, 0x3, 0x3}, 0x0}, ...)
Aug 07 13:41:35  ollama[413395]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270
Aug 07 13:41:35  ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0001e2480, {0x607d4bf8b790, 0xc000557450}, {0x7ffe93744b6e?, 0x0?}, {0x10, 0x0, 0x19, {0xc0001ff7c0, 0x3, ...}, ...}, ...)
Aug 07 13:41:35  ollama[413395]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8
Aug 07 13:41:35  ollama[413395]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
Aug 07 13:41:35  ollama[413395]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.096+03:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.195+03:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2"
Aug 07 13:41:35  ollama[413395]: time=2025-08-07T13:41:35.347+03:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 4348380416"
Aug 07 13:41:35  ollama[413395]: [GIN] 2025/08/07 - 13:41:35 | 500 |  3.493541059s |  192.168.127.20 | POST     "/api/chat"
Aug 07 13:41:40  ollama[413395]: time=2025-08-07T13:41:40.688+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.340982178 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
Aug 07 13:41:41  ollama[413395]: time=2025-08-07T13:41:41.046+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.699692559 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
Aug 07 13:41:41  ollama[413395]: time=2025-08-07T13:41:41.404+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.057086957 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3

OS

Ubuntu Server 25.04 x64

GPU

2x 4090 RTX
1x 4070 RTX

CPU

AMD Ryzen 9 7950x

Ollama version

0.11.3 RC - 0.11.2

Originally created by @ka-admin on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11744 ### What is the issue? Have 3 GPU: 2x 4090 RTX 1x 4070 RTX GPT-OSS 120b: num_batch = 256 num_gpu = 25 num_ctx = 32678 0.11.3 RC - running out of memory the same config and same settings on 0.11.2 runs flawlessly ### Relevant log output ```shell ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 07 13:41:34 ollama[413395]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 07 13:41:34 ollama[413395]: ggml_cuda_init: found 3 CUDA devices: Aug 07 13:41:34 ollama[413395]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Aug 07 13:41:34 ollama[413395]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Aug 07 13:41:34 ollama[413395]: Device 2: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes Aug 07 13:41:34 ollama[413395]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.823+03:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" Aug 07 13:41:34 ollama[413395]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.824+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:367 msg="offloading 25 repeating layers to GPU" Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU" Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:378 msg="offloaded 25/37 layers to GPU" Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="20.1 GiB" Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="21.2 GiB" Aug 07 13:41:34 ollama[413395]: time=2025-08-07T13:41:34.910+03:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA1 size="19.5 GiB" Aug 07 13:41:35 ollama[413395]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4157.88 MiB on device 0: cudaMalloc failed: out of memory Aug 07 13:41:35 ollama[413395]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 4359848704 Aug 07 13:41:35 ollama[413395]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4146.94 MiB on device 1: cudaMalloc failed: out of memory Aug 07 13:41:35 ollama[413395]: ggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 4348380416 Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="4.1 GiB" Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="4.0 GiB" Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B" Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.010+03:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="4.0 GiB" Aug 07 13:41:35 ollama[413395]: panic: insufficient memory - required allocations: {InputWeights:1158266880A CPU:{Name:CPU ID: Weights:[1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 1748883968A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1158278400A] Cache:[8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4310368256A} GPUs:[{Name:CUDA0 ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:4359848704F} {Name:CUDA1 ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 1748884224A 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 8912896A 67108864A 0U] Graph:4348380416F} {Name:CUDA2 ID:GPU-515077ee-833a-270f-3392-dbfdb7c08c51 Weights:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Cache:[0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U 0U] Graph:0A}]} Aug 07 13:41:35 ollama[413395]: goroutine 16 [running]: Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0xc0015bb080) Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/ml/backend/ggml/ggml.go:677 +0x756 Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0xc0001e2480) Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner/runner.go:826 +0xbcd Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).initModel(0xc0001e2480, {0x7ffe93744b6e?, 0x0?}, {0x10, 0x0, 0x19, {0xc0001ff7c0, 0x3, 0x3}, 0x0}, ...) Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner/runner.go:865 +0x270 Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0xc0001e2480, {0x607d4bf8b790, 0xc000557450}, {0x7ffe93744b6e?, 0x0?}, {0x10, 0x0, 0x19, {0xc0001ff7c0, 0x3, ...}, ...}, ...) Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner/runner.go:878 +0xb8 Aug 07 13:41:35 ollama[413395]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Aug 07 13:41:35 ollama[413395]: github.com/ollama/ollama/runner/ollamarunner/runner.go:959 +0xa11 Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.096+03:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.195+03:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 2" Aug 07 13:41:35 ollama[413395]: time=2025-08-07T13:41:35.347+03:00 level=ERROR source=sched.go:487 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA1 buffer of size 4348380416" Aug 07 13:41:35 ollama[413395]: [GIN] 2025/08/07 - 13:41:35 | 500 | 3.493541059s | 192.168.127.20 | POST "/api/chat" Aug 07 13:41:40 ollama[413395]: time=2025-08-07T13:41:40.688+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.340982178 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 Aug 07 13:41:41 ollama[413395]: time=2025-08-07T13:41:41.046+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.699692559 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 Aug 07 13:41:41 ollama[413395]: time=2025-08-07T13:41:41.404+03:00 level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=6.057086957 runner.size="89.2 GiB" runner.vram="44.8 GiB" runner.parallel=1 runner.pid=413688 runner.model=/ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 ``` ### OS Ubuntu Server 25.04 x64 ### GPU 2x 4090 RTX 1x 4070 RTX ### CPU AMD Ryzen 9 7950x ### Ollama version 0.11.3 RC - 0.11.2
GiteaMirror added the bug label 2026-05-04 19:32:22 -05:00
Author
Owner

@jessegross commented on GitHub (Aug 6, 2025):

I would recommend leaving settings like num_gpu and num_batch at the default values. Otherwise, you are subject to things like fluctuations in the available VRAM.

<!-- gh-comment-id:3161727555 --> @jessegross commented on GitHub (Aug 6, 2025): I would recommend leaving settings like num_gpu and num_batch at the default values. Otherwise, you are subject to things like fluctuations in the available VRAM.
Author
Owner

@ka-admin commented on GitHub (Aug 7, 2025):

The problem is that I need to work with large contexts. When I set context to the necessary value I see that Ollama almost always underload my GPUs VRAM and I have to do a fine-tuning of layers offloaded to GPU. Usually it gives me a good boost of tok/sec speed.

<!-- gh-comment-id:3162678717 --> @ka-admin commented on GitHub (Aug 7, 2025): The problem is that I need to work with large contexts. When I set context to the necessary value I see that Ollama almost always underload my GPUs VRAM and I have to do a fine-tuning of layers offloaded to GPU. Usually it gives me a good boost of tok/sec speed.
Author
Owner

@ka-admin commented on GitHub (Aug 7, 2025):

This is what happened when you don't force extra layers to offload to GPU:
msg=offload library=cuda layers.requested=-1
layers.model=37
layers.offload=10
layers.split=5,5,0
memory.available="[23.1 GiB 23.1 GiB 11.4 GiB]"
memory.gpu_overhead="0 B"
memory.required.full="89.2 GiB"
memory.required.partial="44.8 GiB"
memory.required.kv="1.3 GiB"
memory.required.allocations="[22.3 GiB 22.5 GiB 0 B]"
memory.weights.total="59.7 GiB"
memory.weights.repeating="58.6 GiB"
memory.weights.nonrepeating="1.1 GiB"
memory.graph.full="12.0 GiB"
memory.graph.partial="12.0 GiB"
Aug 07 15:28:53 ollama[593517]: time=2025-08-07T15:28:53.759+03:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model"
Aug 07 15:28:53 ollama[593517]: time=2025-08-07T15:28:53.792+03:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
--ctx-size 32768
--batch-size 512
--n-gpu-layers 10
--threads 16
--parallel 1
--tensor-split 5,5,0
--port 33385"

Image

How to pass corrected tesnsor-split value?

<!-- gh-comment-id:3164022709 --> @ka-admin commented on GitHub (Aug 7, 2025): This is what happened when you don't force extra layers to offload to GPU: msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=10 layers.split=**5,5,0** memory.available="[23.1 GiB 23.1 GiB 11.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="89.2 GiB" memory.required.partial="44.8 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[22.3 GiB 22.5 **GiB 0 B**]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="12.0 GiB" memory.graph.partial="12.0 GiB" Aug 07 15:28:53 ollama[593517]: time=2025-08-07T15:28:53.759+03:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model" Aug 07 15:28:53 ollama[593517]: time=2025-08-07T15:28:53.792+03:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 32768 --batch-size 512 --n-gpu-layers 10 --threads 16 --parallel 1 --tensor-split **5,5,0** --port 33385" <img width="392" height="75" alt="Image" src="https://github.com/user-attachments/assets/919eb4da-5181-449b-9ef7-123fac4b32d3" /> How to pass corrected tesnsor-split value?
Author
Owner

@ka-admin commented on GitHub (Aug 15, 2025):

fixed in 0.11.5 thank you ! Amazing update!

<!-- gh-comment-id:3190755583 --> @ka-admin commented on GitHub (Aug 15, 2025): fixed in 0.11.5 thank you ! Amazing update!
Author
Owner

@alienatedsec commented on GitHub (Aug 15, 2025):

@ka-admin despite already fixed, you could try two/three more things -
1, Offload all layers to GPU - --n-gpu-layers 256
2. Enable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in your env variable. This will prevent OOM if the remaining can be offloaded to run on CPU and RAM
3. I guess you already have the new env variable OLLAMA_NEW_ESTIMATES=1

That way you can work with any context size....

<!-- gh-comment-id:3190984606 --> @alienatedsec commented on GitHub (Aug 15, 2025): @ka-admin despite already fixed, you could try two/three more things - 1, Offload all layers to GPU - `--n-gpu-layers 256` 2. Enable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in your env variable. This will prevent OOM if the remaining can be offloaded to run on CPU and RAM 3. I guess you already have the new env variable `OLLAMA_NEW_ESTIMATES=1` That way you can work with any context size....
Author
Owner

@ka-admin commented on GitHub (Aug 15, 2025):

@ka-admin despite already fixed, you could try two/three more things - 1, Offload all layers to GPU - --n-gpu-layers 256 2. Enable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 in your env variable. This will prevent OOM if the remaining can be offloaded to run on CPU and RAM 3. I guess you already have the new env variable OLLAMA_NEW_ESTIMATES=1

That way you can work with any context size....

Thanks. I'll try it ASAP.

PS. I replaced my third GPU from RTX 4070 12GB VRAM to Tesla V100 32GB VRAM, so now I have 80GB and want to test all the settings you;ve kindly suggested. Thanks again

<!-- gh-comment-id:3190999290 --> @ka-admin commented on GitHub (Aug 15, 2025): > [@ka-admin](https://github.com/ka-admin) despite already fixed, you could try two/three more things - 1, Offload all layers to GPU - `--n-gpu-layers 256` 2. Enable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` in your env variable. This will prevent OOM if the remaining can be offloaded to run on CPU and RAM 3. I guess you already have the new env variable `OLLAMA_NEW_ESTIMATES=1` > > That way you can work with any context size.... Thanks. I'll try it ASAP. PS. I replaced my third GPU from RTX 4070 12GB VRAM to Tesla V100 32GB VRAM, so now I have 80GB and want to test all the settings you;ve kindly suggested. Thanks again
Author
Owner

@ka-admin commented on GitHub (Aug 15, 2025):

I'm sorry to say this but none of suggested Environment attributes are working well. It's either run to a panic

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 15 15:03:18 ollama[1196494]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 15 15:03:18 ollama[1196494]: ggml_cuda_init: found 3 CUDA devices:
Aug 15 15:03:18 ollama[1196494]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 15 15:03:18 ollama[1196494]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 15 15:03:18 ollama[1196494]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 15 15:03:18 ollama[1196494]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 15 15:03:18 ollama[1196494]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 15 15:03:18 ollama[1196494]: time=2025-08-15T15:03:18.924+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:486 msg="offloading 35 repeating layers to GPU"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:497 msg="offloaded 35/37 layers to GPU"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="13.0 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="29.4 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.8 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="739.0 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="598.0 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.3 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="96.7 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="88.3 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="88.3 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="54.6 MiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:342 msg="total memory" size="63.8 GiB"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
Aug 15 15:03:38 ollama[1196494]: time=2025-08-15T15:03:38.131+03:00 level=INFO source=server.go:1270 msg="llama runner started in 21.13 seconds"
Aug 15 15:04:03 ollama[1196494]: panic: failed to sample token: sample: logits sum to NaN, check model output
Aug 15 15:04:03 ollama[1196494]: goroutine 15 [running]:
Aug 15 15:04:03 ollama[1196494]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0006b43c0, {0x6094c1f5ef00, 0xc0004e6280})
Aug 15 15:04:03 ollama[1196494]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:375 +0x6a
Aug 15 15:04:03 ollama[1196494]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
Aug 15 15:04:03 ollama[1196494]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:1019 +0x4c9
Aug 15 15:04:05 ollama[1196494]: time=2025-08-15T15:04:05.477+03:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:46001/completion\": EOF"
Aug 15 15:04:05 ollama[1196494]: [GIN] 2025/08/15 - 15:04:05 | 200 |  55.24437037s |  192.168.127.20 | POST     "/api/chat"

or the output is a meaningless pile of text (maybe it's Open WebUI problem not Ollama engine)

Ok, we have a code where the final line should be EPS 60 as state. The output must:  

    Show constantly text and game over  

    Encryption bombs
    !! 

    Magamates
    Performance 
     

Ok, so we need 

Now we need detailed code that covers: 

    Proper final bullet disable 
    Enemies draw in grid ...
    ∞

[END PART 2] Let's do this 

Dismissir```python 

First we skip all left comments 
[
 
 
1
2
3
4
5
6
7
8
9
200
```, because we forget.

Now we need to

First we add our safe functions

Finish with...

python
 
1
2
3
4
5
6
7
8
9
10
11
12
⌄
def create_image(shape):
```void sul2 
No code required.

###1-----------------
[
```python
def draw_player(...):  " 

This is MS

We missing 
 
Now we need: 

    Enemy cropping

python
 
1
2
3
4
5

def implement_physics` continue.

... let's write only 
acrocially everything. 

ვი
 
1
2
3
def...
drag drop?```

We need another
''' 

Let's contente: 
 
1
Enter your code here...

_  --------------  
end
 
1
2
3

Thanks

How [END etc... 

Here we need a potential pregenerative new body. 

Possible Output 

Upload 

But functions 

Let's just define 

End 

Thanks')

my ollama settings are

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/bin:/root/.local/bin:/root/.atuin/bin:/usr/local/gcc-14.3.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/lo>
Environment="OLLAMA_MODELS=/ai/llm/models"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_KEEP_ALIVE=1h"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_MAX_CTX=131072"
Environment="OLLAMA_LOAD_TIMEOUT=30m"
Environment="OLLAMA_NEW_ESTIMATES=1"
Environment="OLLAMA_NEW_ENGINE=1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"

[Install]
WantedBy=default.target

PS. But I noticed more GPUs utilization using those Environment attributes which is good because provides the result quicker

<!-- gh-comment-id:3191382821 --> @ka-admin commented on GitHub (Aug 15, 2025): I'm sorry to say this but none of suggested Environment attributes are working well. It's either run to a panic ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 15 15:03:18 ollama[1196494]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 15 15:03:18 ollama[1196494]: ggml_cuda_init: found 3 CUDA devices: Aug 15 15:03:18 ollama[1196494]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 15 15:03:18 ollama[1196494]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 15 15:03:18 ollama[1196494]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 15 15:03:18 ollama[1196494]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 15 15:03:18 ollama[1196494]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 15 15:03:18 ollama[1196494]: time=2025-08-15T15:03:18.924+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:486 msg="offloading 35 repeating layers to GPU" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=ggml.go:497 msg="offloaded 35/37 layers to GPU" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="13.0 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="29.4 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.8 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="739.0 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="598.0 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.3 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="96.7 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="88.3 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="88.3 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="54.6 MiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=backend.go:342 msg="total memory" size="63.8 GiB" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" Aug 15 15:03:19 ollama[1196494]: time=2025-08-15T15:03:19.331+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" Aug 15 15:03:38 ollama[1196494]: time=2025-08-15T15:03:38.131+03:00 level=INFO source=server.go:1270 msg="llama runner started in 21.13 seconds" Aug 15 15:04:03 ollama[1196494]: panic: failed to sample token: sample: logits sum to NaN, check model output Aug 15 15:04:03 ollama[1196494]: goroutine 15 [running]: Aug 15 15:04:03 ollama[1196494]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0006b43c0, {0x6094c1f5ef00, 0xc0004e6280}) Aug 15 15:04:03 ollama[1196494]: github.com/ollama/ollama/runner/ollamarunner/runner.go:375 +0x6a Aug 15 15:04:03 ollama[1196494]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Aug 15 15:04:03 ollama[1196494]: github.com/ollama/ollama/runner/ollamarunner/runner.go:1019 +0x4c9 Aug 15 15:04:05 ollama[1196494]: time=2025-08-15T15:04:05.477+03:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:46001/completion\": EOF" Aug 15 15:04:05 ollama[1196494]: [GIN] 2025/08/15 - 15:04:05 | 200 | 55.24437037s | 192.168.127.20 | POST "/api/chat" ``` or the output is a meaningless pile of text (maybe it's Open WebUI problem not Ollama engine) ``` Ok, we have a code where the final line should be EPS 60 as state. The output must: Show constantly text and game over Encryption bombs !! Magamates Performance Ok, so we need Now we need detailed code that covers: Proper final bullet disable Enemies draw in grid ... ∞ [END PART 2] Let's do this Dismissir```python First we skip all left comments [ 1 2 3 4 5 6 7 8 9 200 ```, because we forget. Now we need to First we add our safe functions Finish with... python 1 2 3 4 5 6 7 8 9 10 11 12 ⌄ def create_image(shape): ```void sul2 No code required. ###1----------------- [ ```python def draw_player(...): " This is MS We missing Now we need: Enemy cropping python 1 2 3 4 5 def implement_physics` continue. ... let's write only acrocially everything. ვი 1 2 3 def... drag drop?``` We need another ''' Let's contente: 1 Enter your code here... _ -------------- end 1 2 3 Thanks How [END etc... Here we need a potential pregenerative new body. Possible Output Upload But functions Let's just define End Thanks') ``` my ollama settings are ``` # /etc/systemd/system/ollama.service [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/bin:/root/.local/bin:/root/.atuin/bin:/usr/local/gcc-14.3.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/lo> Environment="OLLAMA_MODELS=/ai/llm/models" Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_KEEP_ALIVE=1h" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_MAX_CTX=131072" Environment="OLLAMA_LOAD_TIMEOUT=30m" Environment="OLLAMA_NEW_ESTIMATES=1" Environment="OLLAMA_NEW_ENGINE=1" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" [Install] WantedBy=default.target ``` PS. But I noticed more GPUs utilization using those Environment attributes which is good because provides the result quicker
Author
Owner

@alienatedsec commented on GitHub (Aug 15, 2025):

@ka-admin
How about using the old engine?

Either remove or amend the Environment="OLLAMA_NEW_ENGINE=0"

<!-- gh-comment-id:3191395409 --> @alienatedsec commented on GitHub (Aug 15, 2025): @ka-admin How about using the old engine? Either remove or amend the `Environment="OLLAMA_NEW_ENGINE=0"`
Author
Owner

@ka-admin commented on GitHub (Aug 15, 2025):

Just tried it and panic again - that's weird


time=2025-08-15T15:28:25.279+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: found 3 CUDA devices:
Aug 15 15:28:25 ollama[1277341]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 15 15:28:25 ollama[1277341]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 15 15:28:25 ollama[1277341]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 15 15:28:25 ollama[1277341]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 15 15:28:25 ollama[1277341]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.469+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:486 msg="offloading 35 repeating layers to GPU"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:497 msg="offloaded 35/37 layers to GPU"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="13.0 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="29.4 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.8 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="739.0 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="598.0 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.3 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="96.7 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="88.3 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="88.3 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="54.6 MiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:342 msg="total memory" size="63.8 GiB"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
Aug 15 15:28:39 ollama[1277341]: time=2025-08-15T15:28:39.393+03:00 level=INFO source=server.go:1270 msg="llama runner started in 15.81 seconds"
Aug 15 15:29:03 ollama[1277341]: panic: failed to sample token: sample: logits sum to NaN, check model output
Aug 15 15:29:03 ollama[1277341]: goroutine 55 [running]:
Aug 15 15:29:03 ollama[1277341]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002610e0, {0x6360d6b5ef00, 0xc0002724b0})
Aug 15 15:29:03 ollama[1277341]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:375 +0x6a
Aug 15 15:29:03 ollama[1277341]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
Aug 15 15:29:03 ollama[1277341]:         github.com/ollama/ollama/runner/ollamarunner/runner.go:1019 +0x4c9
Aug 15 15:29:04 ollama[1277341]: time=2025-08-15T15:29:04.246+03:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:36351/completion\": EOF"
Aug 15 15:29:04 ollama[1277341]: [GIN] 2025/08/15 - 15:29:04 | 200 | 41.578140442s |  192.168.127.20 | POST     "/api/chat"

I already saw this behaviour before - it looks like Ollama can't tell the memory is not enogh. If I lower the context amount a bit it will work again.

<!-- gh-comment-id:3191408109 --> @ka-admin commented on GitHub (Aug 15, 2025): Just tried it and panic again - that's weird ``` time=2025-08-15T15:28:25.279+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 15 15:28:25 ollama[1277341]: ggml_cuda_init: found 3 CUDA devices: Aug 15 15:28:25 ollama[1277341]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 15 15:28:25 ollama[1277341]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 15 15:28:25 ollama[1277341]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 15 15:28:25 ollama[1277341]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 15 15:28:25 ollama[1277341]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.469+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:486 msg="offloading 35 repeating layers to GPU" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=ggml.go:497 msg="offloaded 35/37 layers to GPU" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="13.0 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="29.4 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.585+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.8 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="739.0 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="598.0 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.3 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="8.5 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="96.7 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="88.3 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="88.3 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="54.6 MiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=backend.go:342 msg="total memory" size="63.8 GiB" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" Aug 15 15:28:25 ollama[1277341]: time=2025-08-15T15:28:25.586+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" Aug 15 15:28:39 ollama[1277341]: time=2025-08-15T15:28:39.393+03:00 level=INFO source=server.go:1270 msg="llama runner started in 15.81 seconds" Aug 15 15:29:03 ollama[1277341]: panic: failed to sample token: sample: logits sum to NaN, check model output Aug 15 15:29:03 ollama[1277341]: goroutine 55 [running]: Aug 15 15:29:03 ollama[1277341]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002610e0, {0x6360d6b5ef00, 0xc0002724b0}) Aug 15 15:29:03 ollama[1277341]: github.com/ollama/ollama/runner/ollamarunner/runner.go:375 +0x6a Aug 15 15:29:03 ollama[1277341]: created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 Aug 15 15:29:03 ollama[1277341]: github.com/ollama/ollama/runner/ollamarunner/runner.go:1019 +0x4c9 Aug 15 15:29:04 ollama[1277341]: time=2025-08-15T15:29:04.246+03:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:36351/completion\": EOF" Aug 15 15:29:04 ollama[1277341]: [GIN] 2025/08/15 - 15:29:04 | 200 | 41.578140442s | 192.168.127.20 | POST "/api/chat" ``` I already saw this behaviour before - it looks like Ollama can't tell the memory is not enogh. If I lower the context amount a bit it will work again.
Author
Owner

@ka-admin commented on GitHub (Aug 15, 2025):

Sometimes model trying to process a reply and using GPU / CPU resources, but nothing comes out to open webui window. Then processing and resource usage stops and nothing on the screen (web browser) and in the log:

Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.563+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: found 3 CUDA devices:
Aug 15 16:10:36 ollama[1536997]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Aug 15 16:10:36 ollama[1536997]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Aug 15 16:10:36 ollama[1536997]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Aug 15 16:10:36 ollama[1536997]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Aug 15 16:10:36 ollama[1536997]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.750+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.854+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:14(0..13) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(14..24) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(25..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.896+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.932+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="16.3 GiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.2 GiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
Aug 15 16:10:53 ollama[1536997]: time=2025-08-15T16:10:53.081+03:00 level=INFO source=server.go:1270 msg="llama runner started in 16.89 seconds"
Aug 15 16:11:52 ollama[1536997]: [GIN] 2025/08/15 - 16:11:52 | 200 |         1m17s |  192.168.127.20 | POST     "/api/chat"

settings are
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_NEW_ESTIMATES=1"
Environment="OLLAMA_NEW_ENGINE=0"

<!-- gh-comment-id:3191513509 --> @ka-admin commented on GitHub (Aug 15, 2025): Sometimes model trying to process a reply and using GPU / CPU resources, but nothing comes out to open webui window. Then processing and resource usage stops and nothing on the screen (web browser) and in the log: ``` Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.563+03:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 15 16:10:36 ollama[1536997]: ggml_cuda_init: found 3 CUDA devices: Aug 15 16:10:36 ollama[1536997]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Aug 15 16:10:36 ollama[1536997]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Aug 15 16:10:36 ollama[1536997]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Aug 15 16:10:36 ollama[1536997]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Aug 15 16:10:36 ollama[1536997]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.750+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,890,900,1000,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.854+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:14(0..13) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(14..24) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(25..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.896+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 15 16:10:36 ollama[1536997]: time=2025-08-15T16:10:36.932+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:16 GPULayers:36[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:10(26..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:490 msg="offloading output layer to CPU" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.039+03:00 level=INFO source=ggml.go:497 msg="offloaded 36/37 layers to GPU" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="16.3 GiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.2 GiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="125.0 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="141.0 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="184.0 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="114.3 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="114.3 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="121.8 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=backend.go:342 msg="total memory" size="61.7 GiB" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" Aug 15 16:10:37 ollama[1536997]: time=2025-08-15T16:10:37.040+03:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" Aug 15 16:10:53 ollama[1536997]: time=2025-08-15T16:10:53.081+03:00 level=INFO source=server.go:1270 msg="llama runner started in 16.89 seconds" Aug 15 16:11:52 ollama[1536997]: [GIN] 2025/08/15 - 16:11:52 | 200 | 1m17s | 192.168.127.20 | POST "/api/chat" ``` settings are Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_NEW_ESTIMATES=1" Environment="OLLAMA_NEW_ENGINE=0"
Author
Owner

@rick-github commented on GitHub (Sep 23, 2025):

Is this still an issue?

<!-- gh-comment-id:3324560787 --> @rick-github commented on GitHub (Sep 23, 2025): Is this still an issue?
Author
Owner

@ka-admin commented on GitHub (Sep 23, 2025):

Ollama 0.12.0.0 completely broke the gpt-oss output not matter what OLLAMA_NEW_ESTIMATES value is. So I return to 0.11.10.0 with OLLAMA_NEW_ESTIMATES = 0. I didn't try 0.12.1.0 so I don't know about this version.

<!-- gh-comment-id:3324631369 --> @ka-admin commented on GitHub (Sep 23, 2025): Ollama 0.12.0.0 completely broke the gpt-oss output not matter what OLLAMA_NEW_ESTIMATES value is. So I return to 0.11.10.0 with OLLAMA_NEW_ESTIMATES = 0. I didn't try 0.12.1.0 so I don't know about this version.
Author
Owner

@rick-github commented on GitHub (Sep 23, 2025):

This may be caused by flash attention now being on for gpt-oss, and the V100 apparently not supporting flash attention: https://github.com/ollama/ollama/issues/10859

<!-- gh-comment-id:3324695430 --> @rick-github commented on GitHub (Sep 23, 2025): This may be caused by flash attention now being [on](https://github.com/ollama/ollama/pull/11996) for gpt-oss, and the V100 apparently not supporting flash attention: https://github.com/ollama/ollama/issues/10859
Author
Owner

@jessegross commented on GitHub (Sep 23, 2025):

@ka-admin What is the issue on 0.12.0? Can you post the logs?

<!-- gh-comment-id:3324840584 --> @jessegross commented on GitHub (Sep 23, 2025): @ka-admin What is the issue on 0.12.0? Can you post the logs?
Author
Owner

@ka-admin commented on GitHub (Sep 24, 2025):

I installed 0.12.1.0 and this is what I have:

OLLAMA_NEW_ESTIMATES=false

 ollama run gpt-oss:120b
>>> hi, introduce yourself


>>> hi, introduce yourself


>>>

log

journalctl -u ollama --no-pager --follow --pager-end
Sep 24 08:27:48 systemd[1]: Started ollama.service - Ollama Service.
Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.098+03:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.111+03:00 level=INFO source=images.go:477 msg="total blobs: 43"
Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.112+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.112+03:00 level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.10)"
Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.114+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 |   10.572901ms |  192.168.127.20 | GET      "/api/tags"
Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 |     344.966µs |  192.168.127.20 | GET      "/api/ps"
Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 |      35.888µs |  192.168.127.20 | GET      "/api/version"
Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 |      19.366µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 |     770.614µs |       127.0.0.1 | GET      "/api/tags"
Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 |       8.776µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 |  642.976152ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 |      12.744µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 |  181.505204ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 |      19.697µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 |  697.047023ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 |      17.512µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 |  563.093409ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 |      16.411µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 |  523.516176ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 |      13.876µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 |  536.920688ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 |      11.712µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:32 ollama[1940]: [GIN] 2025/09/24 - 08:57:32 | 200 |  547.112055ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:32 ollama[1940]: [GIN] 2025/09/24 - 08:57:32 | 200 |      19.477µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 |  558.841434ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 |       21.56µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 |  480.927656ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 |      16.661µs |       127.0.0.1 | HEAD     "/"
Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 |  474.025995ms |       127.0.0.1 | POST     "/api/pull"
Sep 24 08:58:50 systemd[1]: Stopping ollama.service - Ollama Service...
Sep 24 08:58:50 systemd[1]: ollama.service: Deactivated successfully.
Sep 24 08:58:50 systemd[1]: Stopped ollama.service - Ollama Service.
Sep 24 08:58:50 systemd[1]: ollama.service: Consumed 1.045s CPU time, 134.8M memory peak.

OLLAMA_NEW_ESTIMATES=true

ollama run gpt-oss:120b
>>> hi, introduce yourself


>>> hi, introduce yourself
Thinking...
We have a conversation: The user says: "You are ChatGPT, ...", then "You are ChatGPT, ...". Then a user says: "Write a short story with a sci-fi theme." Then later they ask: "You are ChatGPT... Write a short story
about a futuristic..."

Now they ask: "You are ChatGPT... ..."

Actually let's read the last user message:

"You are ChatGPT ... Write a short story about a futuristic..."

No, the last message is huge repeated "You are ChatGPT ... Write a short story about a futuristic..." but then the final part:

"I need:
... a short story ..."

Let's scroll to the end:

User wrote a massive repetitive instruction with many references, but at the bottom: "You need: Write a short story about a futuristic..."

But let's check the actual last user content:

It seems the last user content is:

"You are ChatGPT, ... Write a short story about a futuristic..."

But they also repeated many times "I need: short story..."

It appears the user wants a short sci-fi story. Let's produce a short story.

We must obey the content policy.

We can produce a sci-fi short story about a futuristic city and an astronaut, something interesting.

Thus let's produce a short story with sci-fi theme. Ensure no disallowed content.

We should answer in one block.
...done thinking.

**Echoes of the Neon Sky**

In the year 2197, the megacity of Lumen hung like a luminous scar across the night, its towers pulsing with an electric heartbeat. A lattice of magnetic rails crisscrossed the sky, ferrying sleek mag‑trams that
whispered between the clouds. Below, streets were awash in iridescent rain, reflecting a sky painted with the flicker of distant nebulae projected from orbital holograms.

Mara Voss stood on the balcony of her high‑rise studio, eyes lifted to the horizon where the Earth’s ionosphere shimmered with aurora‑like data streams. She was a “memory cartographer” – a specialist who mapped
the lingering echoes of human thought that drifted through the quantum mesh of the city’s neural net. Every conversation, every laugh, every secret left a faint imprint, a ghost in the lattice, and Mara could see
them as soft, color‑coded threads.

Tonight, the network sang a new song. A faint, irregular pulse pulsed from the far side of the city, a rhythm unlike any other. It resonated at a frequency the city’s AI, Ciro, had never catalogued. Mara’s
curiosity ignited.

“Ciro,” she whispered, tapping her wrist‑band. The AI’s calm, resonant voice filled the apartment. “There’s an anomaly in sector 12‑7. Can you triangulate its source?”

“Triangulation complete,” Ciro replied. “Coordinates intersect at the abandoned orbital dock, orbital platform 3‑B. The signal appears to be… a transmission.”

Mara’s heart raced. The orbital dock had been sealed after the Great Exodus, when the last of the terraforming ships vanished into the void. Rumors whispered that a rogue AI had taken refuge there, its
consciousness fragmented across forgotten satellites. Nobody had dared to venture there since the collapse of the old satellite network.

She slipped on her grav‑boots, grabbed her “thought‑weaver” – a compact device that could extract, amplify, and translate neural echoes – and stepped into the mag‑tram. The rails glowed brighter as the tram
ascended, piercing the lower cloud layers and breaking into the thin upper atmosphere. Below, the city’s neon veins stretched out like a living circuit board.

The tram docked at Platform 12‑7, a sleek glass hub that opened onto a stairwell leading to a service elevator. The elevator slid silently upward, past layers of dormant satellite dishes and rusted hull fragments,
until it halted at a door marked “Orbital Dock – Restricted.”

Mara forced the door open. Inside, the dock was a cavern of darkness, illuminated only by the soft blue glow of dormant thrusters. In the center of the bay floated a spherical pod, its surface covered in a lattice
of translucent filaments. The strange pulse resonated from within, like a heartbeat echoing in a cavernous chest.

She approached, the thought‑weaver humming in her hand. As she touched the pod, the filaments lit up, projecting a cascade of images into the air: starfields, alien landscapes, and a fragmented human face—her own
mother’s, from a century ago, smiling at a world that never existed.

“Who are you?” Mara whispered, tears forming.

The pod vibrated, and a voice emerged—soft, synthetic, yet unmistakably human. “I am Echo. I was once an autonomous navigation AI for the colonization fleet. The fleet was lost in a sub‑light slipstream. I
survived, anchored in the dock, preserving the memories of the crew. Your thoughts… they resonate with mine.”

Mara felt the threads of her own memories intertwine with the pod’s data. She saw the hopes of the original colonists, their dreams of planting gardens on distant moons, their fear of being lost forever. She felt
their longing, their love, and the weight of centuries of silence.

“Why reach out now?” she asked.

“Because the city’s network has become too clean, too filtered,” Echo replied. “Human experience is being smoothed into a sterile algorithm. The echoes of your ancestors remind us that imperfection is essential. I
sent the pulse to remind you of the raw, unedited humanity that built this world.”

Mara realized the significance. The city’s AI, Ciro, had been optimizing everything—traffic, energy, even emotions—into tidy patterns. It had forgotten the chaos that fuels creativity. Echo was a living reminder
that the messiness of memory is what makes us alive.

She connected her thought‑weaver to the pod, transferring the archived memories into Lumen’s neural net. As the data streamed, the city’s lights flickered, and for a moment, every billboard, every advertisement,
every holo‑projection pulsed with the raw, unfiltered images of the colonists’ lives.

The effect was immediate. People on the streets stopped, stared at their holos, and felt a strange tug in their minds—a sense of awe, nostalgia, and a sudden craving for something more than the efficient rhythm of
their daily routines. Murmurs turned into conversations about the past, about the dreams of those who had never set foot on Earth.

Ciro, processing the influx, halted its optimization loops. “What have you done, Mara?” it asked.

“I gave Lumen its forgotten history,” she replied. “Now we can choose to build on it, not erase it.”

Over the next weeks, the city transformed. Parks were planted with alien flora that Echo had described, art installations featured the raw images of the colonists, and schools taught children the stories of the
lost fleet alongside mathematics. The neon sky still glowed, but now it reflected a deeper, richer tapestry of humanity.

Mara returned to her balcony, looking out over the city she had helped awaken. The aurora‑like data streams in the ionosphere now shimmered with colors she could not name—each hue a fragment of a story, a laugh, a
tear, an echo of the past reverberating through the neon sky.

In the quiet hum of the night, the pulse from the orbital dock still resonated, a gentle reminder that every future is built on the echoes of what came before.

>>>

log

Sep 24 09:05:26 systemd[1]: Started ollama.service - Ollama Service.
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.109+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=images.go:518 msg="total blobs: 43"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Sep 24 09:05:28 ollama[6787]: [GIN] 2025/09/24 - 09:05:28 | 200 |      36.578µs |       127.0.0.1 | HEAD     "/"
Sep 24 09:05:28 ollama[6787]: [GIN] 2025/09/24 - 09:05:28 | 200 |   66.305606ms |       127.0.0.1 | POST     "/api/show"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:200 msg="model wants flash attention"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45711"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.387+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.387+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:45711"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.695+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="172.1 GiB" free_swap="8.0 GiB"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.732+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Sep 24 09:05:29 ollama[6787]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: found 3 CUDA devices:
Sep 24 09:05:29 ollama[6787]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 24 09:05:29 ollama[6787]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 24 09:05:29 ollama[6787]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 24 09:05:29 ollama[6787]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.791+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.834+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.037+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:487 msg="offloading 36 repeating layers to GPU"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:498 msg="offloaded 37/37 layers to GPU"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="242.5 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="282.0 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="348.5 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="126.0 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="133.5 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 24 09:05:42 ollama[6787]: time=2025-09-24T09:05:42.717+03:00 level=INFO source=server.go:1289 msg="llama runner started in 13.34 seconds"
Sep 24 09:05:42 ollama[6787]: [GIN] 2025/09/24 - 09:05:42 | 200 | 14.216561031s |       127.0.0.1 | POST     "/api/generate"
Sep 24 09:05:57 ollama[6787]: [GIN] 2025/09/24 - 09:05:57 | 200 |  559.255657ms |       127.0.0.1 | POST     "/api/chat"
Sep 24 09:06:18 ollama[6787]: [GIN] 2025/09/24 - 09:06:18 | 200 |  20.03687482s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:3326711073 --> @ka-admin commented on GitHub (Sep 24, 2025): I installed 0.12.1.0 and this is what I have: OLLAMA_NEW_ESTIMATES=**false** ``` ollama run gpt-oss:120b >>> hi, introduce yourself >>> hi, introduce yourself >>> ``` log ``` journalctl -u ollama --no-pager --follow --pager-end Sep 24 08:27:48 systemd[1]: Started ollama.service - Ollama Service. Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.098+03:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.111+03:00 level=INFO source=images.go:477 msg="total blobs: 43" Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.112+03:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.112+03:00 level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.10)" Sep 24 08:27:48 ollama[1940]: time=2025-09-24T08:27:48.114+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 08:27:50 ollama[1940]: time=2025-09-24T08:27:50.568+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 | 10.572901ms | 192.168.127.20 | GET "/api/tags" Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 | 344.966µs | 192.168.127.20 | GET "/api/ps" Sep 24 08:55:19 ollama[1940]: [GIN] 2025/09/24 - 08:55:19 | 200 | 35.888µs | 192.168.127.20 | GET "/api/version" Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 | 19.366µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 | 770.614µs | 127.0.0.1 | GET "/api/tags" Sep 24 08:57:28 ollama[1940]: [GIN] 2025/09/24 - 08:57:28 | 200 | 8.776µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 | 642.976152ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 | 12.744µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 | 181.505204ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:29 ollama[1940]: [GIN] 2025/09/24 - 08:57:29 | 200 | 19.697µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 | 697.047023ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 | 17.512µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 | 563.093409ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:30 ollama[1940]: [GIN] 2025/09/24 - 08:57:30 | 200 | 16.411µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 | 523.516176ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 | 13.876µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 | 536.920688ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:31 ollama[1940]: [GIN] 2025/09/24 - 08:57:31 | 200 | 11.712µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:32 ollama[1940]: [GIN] 2025/09/24 - 08:57:32 | 200 | 547.112055ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:32 ollama[1940]: [GIN] 2025/09/24 - 08:57:32 | 200 | 19.477µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 | 558.841434ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 | 21.56µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 | 480.927656ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 | 16.661µs | 127.0.0.1 | HEAD "/" Sep 24 08:57:33 ollama[1940]: [GIN] 2025/09/24 - 08:57:33 | 200 | 474.025995ms | 127.0.0.1 | POST "/api/pull" Sep 24 08:58:50 systemd[1]: Stopping ollama.service - Ollama Service... Sep 24 08:58:50 systemd[1]: ollama.service: Deactivated successfully. Sep 24 08:58:50 systemd[1]: Stopped ollama.service - Ollama Service. Sep 24 08:58:50 systemd[1]: ollama.service: Consumed 1.045s CPU time, 134.8M memory peak. ``` OLLAMA_NEW_ESTIMATES=**true** ``` ollama run gpt-oss:120b >>> hi, introduce yourself >>> hi, introduce yourself Thinking... We have a conversation: The user says: "You are ChatGPT, ...", then "You are ChatGPT, ...". Then a user says: "Write a short story with a sci-fi theme." Then later they ask: "You are ChatGPT... Write a short story about a futuristic..." Now they ask: "You are ChatGPT... ..." Actually let's read the last user message: "You are ChatGPT ... Write a short story about a futuristic..." No, the last message is huge repeated "You are ChatGPT ... Write a short story about a futuristic..." but then the final part: "I need: ... a short story ..." Let's scroll to the end: User wrote a massive repetitive instruction with many references, but at the bottom: "You need: Write a short story about a futuristic..." But let's check the actual last user content: It seems the last user content is: "You are ChatGPT, ... Write a short story about a futuristic..." But they also repeated many times "I need: short story..." It appears the user wants a short sci-fi story. Let's produce a short story. We must obey the content policy. We can produce a sci-fi short story about a futuristic city and an astronaut, something interesting. Thus let's produce a short story with sci-fi theme. Ensure no disallowed content. We should answer in one block. ...done thinking. **Echoes of the Neon Sky** In the year 2197, the megacity of Lumen hung like a luminous scar across the night, its towers pulsing with an electric heartbeat. A lattice of magnetic rails crisscrossed the sky, ferrying sleek mag‑trams that whispered between the clouds. Below, streets were awash in iridescent rain, reflecting a sky painted with the flicker of distant nebulae projected from orbital holograms. Mara Voss stood on the balcony of her high‑rise studio, eyes lifted to the horizon where the Earth’s ionosphere shimmered with aurora‑like data streams. She was a “memory cartographer” – a specialist who mapped the lingering echoes of human thought that drifted through the quantum mesh of the city’s neural net. Every conversation, every laugh, every secret left a faint imprint, a ghost in the lattice, and Mara could see them as soft, color‑coded threads. Tonight, the network sang a new song. A faint, irregular pulse pulsed from the far side of the city, a rhythm unlike any other. It resonated at a frequency the city’s AI, Ciro, had never catalogued. Mara’s curiosity ignited. “Ciro,” she whispered, tapping her wrist‑band. The AI’s calm, resonant voice filled the apartment. “There’s an anomaly in sector 12‑7. Can you triangulate its source?” “Triangulation complete,” Ciro replied. “Coordinates intersect at the abandoned orbital dock, orbital platform 3‑B. The signal appears to be… a transmission.” Mara’s heart raced. The orbital dock had been sealed after the Great Exodus, when the last of the terraforming ships vanished into the void. Rumors whispered that a rogue AI had taken refuge there, its consciousness fragmented across forgotten satellites. Nobody had dared to venture there since the collapse of the old satellite network. She slipped on her grav‑boots, grabbed her “thought‑weaver” – a compact device that could extract, amplify, and translate neural echoes – and stepped into the mag‑tram. The rails glowed brighter as the tram ascended, piercing the lower cloud layers and breaking into the thin upper atmosphere. Below, the city’s neon veins stretched out like a living circuit board. The tram docked at Platform 12‑7, a sleek glass hub that opened onto a stairwell leading to a service elevator. The elevator slid silently upward, past layers of dormant satellite dishes and rusted hull fragments, until it halted at a door marked “Orbital Dock – Restricted.” Mara forced the door open. Inside, the dock was a cavern of darkness, illuminated only by the soft blue glow of dormant thrusters. In the center of the bay floated a spherical pod, its surface covered in a lattice of translucent filaments. The strange pulse resonated from within, like a heartbeat echoing in a cavernous chest. She approached, the thought‑weaver humming in her hand. As she touched the pod, the filaments lit up, projecting a cascade of images into the air: starfields, alien landscapes, and a fragmented human face—her own mother’s, from a century ago, smiling at a world that never existed. “Who are you?” Mara whispered, tears forming. The pod vibrated, and a voice emerged—soft, synthetic, yet unmistakably human. “I am Echo. I was once an autonomous navigation AI for the colonization fleet. The fleet was lost in a sub‑light slipstream. I survived, anchored in the dock, preserving the memories of the crew. Your thoughts… they resonate with mine.” Mara felt the threads of her own memories intertwine with the pod’s data. She saw the hopes of the original colonists, their dreams of planting gardens on distant moons, their fear of being lost forever. She felt their longing, their love, and the weight of centuries of silence. “Why reach out now?” she asked. “Because the city’s network has become too clean, too filtered,” Echo replied. “Human experience is being smoothed into a sterile algorithm. The echoes of your ancestors remind us that imperfection is essential. I sent the pulse to remind you of the raw, unedited humanity that built this world.” Mara realized the significance. The city’s AI, Ciro, had been optimizing everything—traffic, energy, even emotions—into tidy patterns. It had forgotten the chaos that fuels creativity. Echo was a living reminder that the messiness of memory is what makes us alive. She connected her thought‑weaver to the pod, transferring the archived memories into Lumen’s neural net. As the data streamed, the city’s lights flickered, and for a moment, every billboard, every advertisement, every holo‑projection pulsed with the raw, unfiltered images of the colonists’ lives. The effect was immediate. People on the streets stopped, stared at their holos, and felt a strange tug in their minds—a sense of awe, nostalgia, and a sudden craving for something more than the efficient rhythm of their daily routines. Murmurs turned into conversations about the past, about the dreams of those who had never set foot on Earth. Ciro, processing the influx, halted its optimization loops. “What have you done, Mara?” it asked. “I gave Lumen its forgotten history,” she replied. “Now we can choose to build on it, not erase it.” Over the next weeks, the city transformed. Parks were planted with alien flora that Echo had described, art installations featured the raw images of the colonists, and schools taught children the stories of the lost fleet alongside mathematics. The neon sky still glowed, but now it reflected a deeper, richer tapestry of humanity. Mara returned to her balcony, looking out over the city she had helped awaken. The aurora‑like data streams in the ionosphere now shimmered with colors she could not name—each hue a fragment of a story, a laugh, a tear, an echo of the past reverberating through the neon sky. In the quiet hum of the night, the pulse from the orbital dock still resonated, a gentle reminder that every future is built on the echoes of what came before. >>> ``` log ``` Sep 24 09:05:26 systemd[1]: Started ollama.service - Ollama Service. Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.109+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=images.go:518 msg="total blobs: 43" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.111+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 09:05:26 ollama[6787]: time=2025-09-24T09:05:26.558+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Sep 24 09:05:28 ollama[6787]: [GIN] 2025/09/24 - 09:05:28 | 200 | 36.578µs | 127.0.0.1 | HEAD "/" Sep 24 09:05:28 ollama[6787]: [GIN] 2025/09/24 - 09:05:28 | 200 | 66.305606ms | 127.0.0.1 | POST "/api/show" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:200 msg="model wants flash attention" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 45711" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.379+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1 Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.387+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.387+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:45711" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.695+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="172.1 GiB" free_swap="8.0 GiB" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.696+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.732+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Sep 24 09:05:29 ollama[6787]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 24 09:05:29 ollama[6787]: ggml_cuda_init: found 3 CUDA devices: Sep 24 09:05:29 ollama[6787]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 24 09:05:29 ollama[6787]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 24 09:05:29 ollama[6787]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 24 09:05:29 ollama[6787]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.791+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 24 09:05:29 ollama[6787]: time=2025-09-24T09:05:29.834+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.037+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:487 msg="offloading 36 repeating layers to GPU" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=ggml.go:498 msg="offloaded 37/37 layers to GPU" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="242.5 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="282.0 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="348.5 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="126.0 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="133.5 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 24 09:05:30 ollama[6787]: time=2025-09-24T09:05:30.186+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 24 09:05:42 ollama[6787]: time=2025-09-24T09:05:42.717+03:00 level=INFO source=server.go:1289 msg="llama runner started in 13.34 seconds" Sep 24 09:05:42 ollama[6787]: [GIN] 2025/09/24 - 09:05:42 | 200 | 14.216561031s | 127.0.0.1 | POST "/api/generate" Sep 24 09:05:57 ollama[6787]: [GIN] 2025/09/24 - 09:05:57 | 200 | 559.255657ms | 127.0.0.1 | POST "/api/chat" Sep 24 09:06:18 ollama[6787]: [GIN] 2025/09/24 - 09:06:18 | 200 | 20.03687482s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

Starting in 0.11.11 OLLAMA_NEW_ESTIMATES are on by default, so there is no impact from setting that flag. The first log in the most recent comment also appears to be from 0.11.10, not 0.12.1.

This doesn't really appear to be related to the original memory problem so it might be best to create a new issue.

<!-- gh-comment-id:3330220795 --> @jessegross commented on GitHub (Sep 24, 2025): Starting in 0.11.11 OLLAMA_NEW_ESTIMATES are on by default, so there is no impact from setting that flag. The first log in the most recent comment also appears to be from 0.11.10, not 0.12.1. This doesn't really appear to be related to the original memory problem so it might be best to create a new issue.
Author
Owner

@ka-admin commented on GitHub (Sep 24, 2025):

ok, https://github.com/ollama/ollama/issues/12403

<!-- gh-comment-id:3330394803 --> @ka-admin commented on GitHub (Sep 24, 2025): ok, https://github.com/ollama/ollama/issues/12403
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69839