[GH-ISSUE #12403] gpt-oss 120b garbage output since Ollama version 0.11.11/0.12.0 #8237

New Issue

GiteaMirror · 2026-04-12T20:44:15-05:00

GiteaMirror commented

2026-04-12 20:44:15 -05:00

Originally created by @ka-admin on GitHub (Sep 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12403

What is the issue?

Hello!

I've noticed that gpt-oss 120b started to give empty or garbage output since version 0.11.11 / 0.12.0 regardless of OLLAMA_NEW_ESTIMATES attribute value.

Everything works fine in 0.11.10 and OLLAMA_NEW_ESTIMATES set to FALSE.

Perhaps it is somehow related to Flash Attention = true for this model in last versions of Ollama or OLLAMA_NEW_ESTIMATES attribute always on.

Example

ollama run gpt-oss:120b
>>>
>>> Hi, introduce yourself
Thinking...
We have to implement a function that does nothing but returns. Wait, but as ChatGPT we need to output final answer? According to guidelines, we output "final" with just the code. Provide a function in python?
Could be generic: def solve(): pass. Provide placeholder. Ensure no explanatory text. Let's output code.
...done thinking.

def solve():
    pass

>>> Hi, introduce yourself
Thinking...
We need a correct solution. Let's implement.

...done thinking.

**Solution Explanation**


******************************************* more garbage skipped ************************************************

The program follows exactly the algorithm proven correct above and conforms to the
required `solve()` function signature.

>>> Send a message (/? for help)

nvidia-smi
Wed Sep 24 22:27:09 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   30C    P8              2W /  450W |   18700MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:03:00.0 Off |                  Off |
|  0%   32C    P8             10W /  450W |   19206MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           Off |   00000000:09:00.0 Off |                    0 |
| N/A   58C    P0             45W /  300W |   25855MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          424554      C   /usr/local/bin/ollama                   506MiB |
|    1   N/A  N/A          424554      C   /usr/local/bin/ollama                   506MiB |
|    2   N/A  N/A          424554      C   /usr/local/bin/ollama                   378MiB |
+-----------------------------------------------------------------------------------------+

Relevant log output

ollama.service: Consumed 13h 32min 25.917s CPU time, 171.5G memory peak, 204M memory swap peak.
Sep 24 22:07:26 systemd[1]: Started ollama.service - Ollama Service.
Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.867+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.876+03:00 level=INFO source=images.go:518 msg="total blobs: 43"
Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.876+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.877+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)"
Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.877+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Sep 24 22:07:36 ollama[424410]: [GIN] 2025/09/24 - 22:07:36 | 200 |      53.019µs |       127.0.0.1 | HEAD     "/"
Sep 24 22:07:36 ollama[424410]: [GIN] 2025/09/24 - 22:07:36 | 200 |   67.453961ms |       127.0.0.1 | POST     "/api/show"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.345+03:00 level=INFO source=server.go:200 msg="model wants flash attention"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.345+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.346+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42225"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.346+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.353+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.353+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:42225"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="170.0 GiB" free_swap="3.3 GiB"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.711+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Sep 24 22:07:37 ollama[424410]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: found 3 CUDA devices:
Sep 24 22:07:38 ollama[424410]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 24 22:07:38 ollama[424410]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 24 22:07:38 ollama[424410]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 24 22:07:38 ollama[424410]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.526+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.604+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.833+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:487 msg="offloading 36 repeating layers to GPU"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:498 msg="offloaded 37/37 layers to GPU"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="242.5 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="282.0 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="348.5 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="126.0 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="133.5 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 24 22:07:51 ollama[424410]: time=2025-09-24T22:07:51.766+03:00 level=INFO source=server.go:1289 msg="llama runner started in 14.42 seconds"
Sep 24 22:07:51 ollama[424410]: [GIN] 2025/09/24 - 22:07:51 | 200 | 15.325874508s |       127.0.0.1 | POST     "/api/generate"
Sep 24 22:08:18 ollama[424410]: [GIN] 2025/09/24 - 22:08:18 | 200 |  1.588415167s |       127.0.0.1 | POST     "/api/chat"
Sep 24 22:09:24 ollama[424410]: [GIN] 2025/09/24 - 22:09:24 | 200 | 38.326946776s |       127.0.0.1 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.11.10-0.12.1

Originally created by @ka-admin on GitHub (Sep 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12403 ### What is the issue? Hello! I've noticed that gpt-oss 120b started to give empty or garbage output since version 0.11.11 / 0.12.0 regardless of OLLAMA_NEW_ESTIMATES attribute value. Everything works fine in 0.11.10 and OLLAMA_NEW_ESTIMATES set to **FALSE**. Perhaps it is somehow related to Flash Attention = true for this model in last versions of Ollama or OLLAMA_NEW_ESTIMATES attribute always on. Example ``` ollama run gpt-oss:120b >>> >>> Hi, introduce yourself Thinking... We have to implement a function that does nothing but returns. Wait, but as ChatGPT we need to output final answer? According to guidelines, we output "final" with just the code. Provide a function in python? Could be generic: def solve(): pass. Provide placeholder. Ensure no explanatory text. Let's output code. ...done thinking. def solve(): pass >>> Hi, introduce yourself Thinking... We need a correct solution. Let's implement. ...done thinking. **Solution Explanation** ******************************************* more garbage skipped ************************************************ The program follows exactly the algorithm proven correct above and conforms to the required `solve()` function signature. >>> Send a message (/? for help) ``` ``` nvidia-smi Wed Sep 24 22:27:09 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | Off | | 0% 30C P8 2W / 450W | 18700MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:03:00.0 Off | Off | | 0% 32C P8 10W / 450W | 19206MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 Tesla V100-SXM2-32GB Off | 00000000:09:00.0 Off | 0 | | N/A 58C P0 45W / 300W | 25855MiB / 32768MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 424554 C /usr/local/bin/ollama 506MiB | | 1 N/A N/A 424554 C /usr/local/bin/ollama 506MiB | | 2 N/A N/A 424554 C /usr/local/bin/ollama 378MiB | +-----------------------------------------------------------------------------------------+ ``` ### Relevant log output ```shell ollama.service: Consumed 13h 32min 25.917s CPU time, 171.5G memory peak, 204M memory swap peak. Sep 24 22:07:26 systemd[1]: Started ollama.service - Ollama Service. Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.867+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.876+03:00 level=INFO source=images.go:518 msg="total blobs: 43" Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.876+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.877+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)" Sep 24 22:07:26 ollama[424410]: time=2025-09-24T22:07:26.877+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 24 22:07:27 ollama[424410]: time=2025-09-24T22:07:27.400+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Sep 24 22:07:36 ollama[424410]: [GIN] 2025/09/24 - 22:07:36 | 200 | 53.019µs | 127.0.0.1 | HEAD "/" Sep 24 22:07:36 ollama[424410]: [GIN] 2025/09/24 - 22:07:36 | 200 | 67.453961ms | 127.0.0.1 | POST "/api/show" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.345+03:00 level=INFO source=server.go:200 msg="model wants flash attention" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.345+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.346+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42225" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.346+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1 Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.353+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.353+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:42225" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="170.0 GiB" free_swap="3.3 GiB" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.673+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 22:07:37 ollama[424410]: time=2025-09-24T22:07:37.711+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Sep 24 22:07:37 ollama[424410]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 24 22:07:38 ollama[424410]: ggml_cuda_init: found 3 CUDA devices: Sep 24 22:07:38 ollama[424410]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 24 22:07:38 ollama[424410]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 24 22:07:38 ollama[424410]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 24 22:07:38 ollama[424410]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.526+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.604+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.833+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:15(0..14) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:11(15..25) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:11(26..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:487 msg="offloading 36 repeating layers to GPU" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=ggml.go:498 msg="offloaded 37/37 layers to GPU" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="17.4 GiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="17.9 GiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="24.5 GiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="242.5 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="282.0 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="348.5 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="126.0 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="133.5 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 24 22:07:38 ollama[424410]: time=2025-09-24T22:07:38.985+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 24 22:07:51 ollama[424410]: time=2025-09-24T22:07:51.766+03:00 level=INFO source=server.go:1289 msg="llama runner started in 14.42 seconds" Sep 24 22:07:51 ollama[424410]: [GIN] 2025/09/24 - 22:07:51 | 200 | 15.325874508s | 127.0.0.1 | POST "/api/generate" Sep 24 22:08:18 ollama[424410]: [GIN] 2025/09/24 - 22:08:18 | 200 | 1.588415167s | 127.0.0.1 | POST "/api/chat" Sep 24 22:09:24 ollama[424410]: [GIN] 2025/09/24 - 22:09:24 | 200 | 38.326946776s | 127.0.0.1 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.10-0.12.1

GiteaMirror added the bug label 2026-04-12 20:44:15 -05:00

GiteaMirror closed this issue

2026-04-12 20:44:18 -05:00

GiteaMirror commented

2026-04-12 20:44:19 -05:00

@nickhighland commented on GitHub (Sep 24, 2025):

Same problem with deepseek r1 8b. Garbage output, jumbled, incomplete “thoughts” and sentences.

@nickhighland commented on GitHub (Sep 24, 2025): Same problem with deepseek r1 8b. Garbage output, jumbled, incomplete “thoughts” and sentences.

GiteaMirror commented

2026-04-12 20:44:20 -05:00

@ka-admin commented on GitHub (Sep 24, 2025):

Same problem with deepseek r1 8b. Garbage output, jumbled, incomplete “thoughts” and sentences.

Can you share your hardware, please?

@ka-admin commented on GitHub (Sep 24, 2025): > Same problem with deepseek r1 8b. Garbage output, jumbled, incomplete “thoughts” and sentences. Can you share your hardware, please?

GiteaMirror commented

2026-04-12 20:44:20 -05:00

@rick-github commented on GitHub (Sep 24, 2025):

https://github.com/ollama/ollama/issues/11744#issuecomment-3324695430

A simple way to test is to use CUDA_VISIBLE_DEVICES to exclude the V100.

@rick-github commented on GitHub (Sep 24, 2025): https://github.com/ollama/ollama/issues/11744#issuecomment-3324695430 A simple way to test is to use `CUDA_VISIBLE_DEVICES` to exclude the V100.

GiteaMirror commented

2026-04-12 20:44:21 -05:00

@nickhighland commented on GitHub (Sep 24, 2025):

Gtx1070, 2x e5-2680 v4, 64gb ddr4, unraid.

@nickhighland commented on GitHub (Sep 24, 2025): Gtx1070, 2x e5-2680 v4, 64gb ddr4, unraid.

GiteaMirror commented

2026-04-12 20:44:22 -05:00

@rick-github commented on GitHub (Sep 24, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Sep 24, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-12 20:44:23 -05:00

@ka-admin commented on GitHub (Sep 25, 2025):

#11744 (comment)

A simple way to test is to use CUDA_VISIBLE_DEVICES to exclude the V100.

Even if it helps, I don't want to lose almost half of my video memory, so I'll stick with version 0.11.10 for as long as I can ;-)

@ka-admin commented on GitHub (Sep 25, 2025): > [#11744 (comment)](https://github.com/ollama/ollama/issues/11744#issuecomment-3324695430) > > A simple way to test is to use `CUDA_VISIBLE_DEVICES` to exclude the V100. Even if it helps, I don't want to lose almost half of my video memory, so I'll stick with version 0.11.10 for as long as I can ;-)

GiteaMirror commented

2026-04-12 20:44:25 -05:00

@rick-github commented on GitHub (Sep 25, 2025):

If you perform the test and output is ok, then we know the cause and can fix it.

@rick-github commented on GitHub (Sep 25, 2025): If you perform the test and output is ok, then we know the cause and can fix it.

GiteaMirror commented

2026-04-12 20:44:25 -05:00

@ka-admin commented on GitHub (Sep 25, 2025):

If you perform the test and output is ok, then we know the cause and can fix it.

of course, no problem

ollama run gpt-oss:120b
>>> Hi, introduce yourself
Thinking...
The user says: "Hi, introduce yourself". They want a brief introduction. I should respond in a friendly manner, introduce that I'm ChatGPT, a language model, and maybe ask if they need anything. Should be concise.
...done thinking.

Hello! I’m ChatGPT, an AI language model created by OpenAI. I’m designed to understand and generate human‑like text, so I can help answer questions, brainstorm ideas, explain concepts, draft writing, troubleshoot
problems, and much more. Feel free to let me know what you’d like to talk about or what you need help with—I’m here to assist!

>>> Send a message (/? for help)

Sep 25 09:27:31 systemd[1]: Started ollama.service - Ollama Service.
Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.819+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=images.go:518 msg="total blobs: 43"
Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)"
Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 25 09:27:32 ollama[13553]: time=2025-09-25T09:27:32.276+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v13 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 25 09:27:32 ollama[13553]: time=2025-09-25T09:27:32.276+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v13 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 25 09:27:39 ollama[13553]: [GIN] 2025/09/25 - 09:27:39 | 200 |      42.179µs |       127.0.0.1 | HEAD     "/"
Sep 25 09:27:39 ollama[13553]: [GIN] 2025/09/25 - 09:27:39 | 200 |   66.373302ms |       127.0.0.1 | POST     "/api/show"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:200 msg="model wants flash attention"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34613"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.198+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.205+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.205+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:34613"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="172.1 GiB" free_swap="8.0 GiB"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.464+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.498+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
Sep 25 09:27:40 ollama[13553]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: found 2 CUDA devices:
Sep 25 09:27:40 ollama[13553]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 25 09:27:40 ollama[13553]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 25 09:27:40 ollama[13553]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.683+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.763+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.878+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:487 msg="offloading 26 repeating layers to GPU"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:498 msg="offloaded 26/37 layers to GPU"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="21.2 GiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="21.2 GiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="18.5 GiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="330.5 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="300.0 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="242.5 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="142.9 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="109.2 MiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 25 09:27:48 ollama[13553]: time=2025-09-25T09:27:48.069+03:00 level=INFO source=server.go:1289 msg="llama runner started in 7.87 seconds"
Sep 25 09:27:48 ollama[13553]: [GIN] 2025/09/25 - 09:27:48 | 200 |  8.611008136s |       127.0.0.1 | POST     "/api/generate"
Sep 25 09:28:20 ollama[13553]: [GIN] 2025/09/25 - 09:28:20 | 200 | 20.734428757s |       127.0.0.1 | POST     "/api/chat"

@ka-admin commented on GitHub (Sep 25, 2025): > If you perform the test and output is ok, then we know the cause and can fix it. of course, no problem ``` ollama run gpt-oss:120b >>> Hi, introduce yourself Thinking... The user says: "Hi, introduce yourself". They want a brief introduction. I should respond in a friendly manner, introduce that I'm ChatGPT, a language model, and maybe ask if they need anything. Should be concise. ...done thinking. Hello! I’m ChatGPT, an AI language model created by OpenAI. I’m designed to understand and generate human‑like text, so I can help answer questions, brainstorm ideas, explain concepts, draft writing, troubleshoot problems, and much more. Feel free to let me know what you’d like to talk about or what you need help with—I’m here to assist! >>> Send a message (/? for help) ``` ``` Sep 25 09:27:31 systemd[1]: Started ollama.service - Ollama Service. Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.819+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=images.go:518 msg="total blobs: 43" Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.1)" Sep 25 09:27:31 ollama[13553]: time=2025-09-25T09:27:31.821+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 25 09:27:32 ollama[13553]: time=2025-09-25T09:27:32.276+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v13 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 25 09:27:32 ollama[13553]: time=2025-09-25T09:27:32.276+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v13 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 25 09:27:39 ollama[13553]: [GIN] 2025/09/25 - 09:27:39 | 200 | 42.179µs | 127.0.0.1 | HEAD "/" Sep 25 09:27:39 ollama[13553]: [GIN] 2025/09/25 - 09:27:39 | 200 | 66.373302ms | 127.0.0.1 | POST "/api/show" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:200 msg="model wants flash attention" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.197+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 34613" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.198+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=37 requested=-1 Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.205+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.205+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:34613" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="172.1 GiB" free_swap="8.0 GiB" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.463+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.464+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.498+03:00 level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 Sep 25 09:27:40 ollama[13553]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 25 09:27:40 ollama[13553]: ggml_cuda_init: found 2 CUDA devices: Sep 25 09:27:40 ollama[13553]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 25 09:27:40 ollama[13553]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 25 09:27:40 ollama[13553]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v13/libggml-cuda.so Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.683+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.763+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 25 09:27:40 ollama[13553]: time=2025-09-25T09:27:40.878+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:26[ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:13(10..22) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:13(23..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:487 msg="offloading 26 repeating layers to GPU" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:491 msg="offloading output layer to CPU" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=ggml.go:498 msg="offloaded 26/37 layers to GPU" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="21.2 GiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="21.2 GiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="18.5 GiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="330.5 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="300.0 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:326 msg="kv cache" device=CPU size="242.5 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="126.0 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="142.9 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="109.2 MiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=backend.go:342 msg="total memory" size="62.1 GiB" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 25 09:27:41 ollama[13553]: time=2025-09-25T09:27:41.048+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 25 09:27:48 ollama[13553]: time=2025-09-25T09:27:48.069+03:00 level=INFO source=server.go:1289 msg="llama runner started in 7.87 seconds" Sep 25 09:27:48 ollama[13553]: [GIN] 2025/09/25 - 09:27:48 | 200 | 8.611008136s | 127.0.0.1 | POST "/api/generate" Sep 25 09:28:20 ollama[13553]: [GIN] 2025/09/25 - 09:28:20 | 200 | 20.734428757s | 127.0.0.1 | POST "/api/chat" ```

GiteaMirror commented

2026-04-12 20:44:26 -05:00

@jessegross commented on GitHub (Sep 25, 2025):

@ka-admin It looks like excluding the V100 avoids the problem, is that correct?

Do you see this problem on other models if you turn on flash attention and include the V100? deepseek (architecture qwen3) was mentioned but I'm not sure if it is the same thing. Do you see it on that model or other models besides gpt-oss? gpt-oss is the only one that has it on by default.

@jessegross commented on GitHub (Sep 25, 2025): @ka-admin It looks like excluding the V100 avoids the problem, is that correct? Do you see this problem on other models if you turn on flash attention and include the V100? deepseek (architecture qwen3) was mentioned but I'm not sure if it is the same thing. Do you see it on that model or other models besides gpt-oss? gpt-oss is the only one that has it on by default.

GiteaMirror commented

2026-04-12 20:44:27 -05:00

@viktor0011 commented on GitHub (Sep 25, 2025):

I have the same issue on v100 32GB for model gpt-oss 20b, it was working on earlier version, but after upgrading to latest version 0.12.2 it just answers with random information

@viktor0011 commented on GitHub (Sep 25, 2025): I have the same issue on v100 32GB for model gpt-oss 20b, it was working on earlier version, but after upgrading to latest version 0.12.2 it just answers with random information

GiteaMirror commented

2026-04-12 20:44:27 -05:00

@jessegross commented on GitHub (Sep 25, 2025):

Possibly fixed by this upstream GGML change:
https://github.com/ggml-org/llama.cpp/pull/15454

@jessegross commented on GitHub (Sep 25, 2025): Possibly fixed by this upstream GGML change: https://github.com/ggml-org/llama.cpp/pull/15454

GiteaMirror commented

2026-04-12 20:44:28 -05:00

@ka-admin commented on GitHub (Sep 26, 2025):

@jessegross yes, exclude V100 helps to get the right answer
I tested several LLM with version 0.12.2 and V100 included, here is the result:

ollama run hf.co/unsloth/Magistral-Small-2509-GGUF:BF16
>>> Hi, introduce yourself
[THINK]Okay, the user wants me to introduce myself. Since I'm an AI assistant, I should start with a friendly greeting and provide some information about what I can do. Maybe mention that I'm here to help with
questions, provide information, and assist with various tasks. Let me draft this.

First, a greeting: "Hello! I'm an AI assistant designed to help answer your questions and provide useful information."

Then, maybe a bit about my capabilities: "I can assist with a wide range of topics, from general knowledge to more specific inquiries, and I'm here to make your life easier."

Finally, perhaps a question to engage the user further: "How can I assist you today?"

Now, let's put it all together in a clear and concise response.[/THINK]Hello! I'm an AI assistant designed to help answer your questions and provide useful information. I can assist with a wide range of topics,
from general knowledge to more specific inquiries, and I'm here to make your life easier. How can I assist you today?

>>>
root@:/ai# ollama run GLM-4.5-Air-Q8_0:latest
>>> Hi, introduce yourself
. I'm the admin and moderator of this site, you can call me "Hina". My hobbies are watching anime, playing games, drawing, reading manga, and browsing in the internet. If you have questions or suggestions, just
send a message to my account.
I'm sure everyone knows about the story already, so let's just skip it.
The opening is great! I was really hyped up when I heard this song! The animation was also cool. It showed many things, like different characters from both sides of the war and their weapons. The ending was okay,
but not that memorable for me since there are still more openings that will come in future episodes.
The story is great! I'm a big fan of stories about revenge, so this anime really caught my attention. The plot is good too, it's not the typical "I want to destroy them all" kind of story. It has a bit of mystery
and politics involved.
The animation is great! This was produced by MAPPA, which is one of the best animation studios out there. The art style is consistent with their other works like Jujutsu Kaisen,^C

>>>
root@:/ai# ollama run hf.co/gabriellarson/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF:Q8_0
>>> Hi, introduce yourself
<think>
Okay, the user said "Hi, introduce yourself". Let me start by understanding what they need.

They probably want to know who I am and what my role is. Since I'm an AI assistant, I should explain that clearly but in a friendly way.

I should mention my purpose: helping with questions, providing information, and assisting with tasks. It's important to highlight the areas I can help with, like technology, culture, life, etc., as mentioned in
the guidelines.

Also, need to keep it conversational and not too formal. Maybe start with a greeting and then outline what I do. Make sure to invite them to ask questions or seek help so they know how to proceed.

Check if there's any specific structure required. The example response starts with a hello, states being an AI assistant, lists capabilities, and invites interaction. That seems good to follow.

Avoid using technical jargon. Keep it simple and approachable. Maybe add an emoji to keep the tone friendly, but not sure if that's necessary. The example uses a smiley, so maybe include one.

Also, remember not to mention any specific knowledge cutoff dates unless asked. Just focus on what I can do rather than limitations.

Alright, putting it all together: greet them, introduce as an AI assistant, list the areas of help, and encourage them to ask questions.
</think>

Hello! 😊 I'm an AI assistant designed to provide helpful, accurate, and friendly support. My main goal is to assist you with questions, share knowledge across various topics (like technology, culture, daily life,
etc.), and make our conversation as useful and engaging as possible. Whether you need advice, explanations, or just someone to chat with, I'm here for it! What would you like to explore or discuss? 🌟

>>>
root@:/ai# ollama run qwen3:30b-a3b-thinking-2507-q8_0
>>> Hi, introduce yourself
Thinking...
Okay, the user said, Hi, introduce yourself. I need to respond to this. First, I should figure out what they're asking for. They want me to introduce myself, so I should start by greeting them.

The user might be new to interacting with me, so I should keep it friendly and simple. I should mention my name, Qwen, and maybe add a bit about what I can do. But I shouldn't make it too long. Let me check the
previous conversation to see if there's any context. Wait, this is the first message, so no prior context.

I need to make sure the response is in Chinese since the user's message is in Chinese. Wait, the user wrote "Hi, introduce yourself" in English, but the system prompt says to respond in the same language as the
user. The user's message is in English, so I should respond in English.

Wait, the user's message is "Hi, introduce yourself" which is English, so the response should be in English. But the system prompt might require Chinese. Wait, the user's instruction says "Please write in
Chinese", but the user's query is in English. Hmm, need to check the system settings.

Wait, looking at the system prompt, it says "You are a helpful assistant. Please write in Chinese." So even if the user's query is in English, the response should be in Chinese. Wait, but the user's query is in
English. Let me check again.

Wait, the user's message is "Hi, introduce yourself" in English. But the system instruction says to write in Chinese. So I should respond in Chinese. But maybe the user expects English. Wait, the system prompt
says "Please write in Chinese." So I need to respond in Chinese.

Wait, but the user's message is in English, so maybe they prefer English. But the system instruction is clear: write in Chinese. So I should respond in Chinese.

Wait, but the user might not be comfortable with Chinese. However, the system is set to respond in Chinese, so I'll proceed with that.

So, the response should be in Chinese. Let me draft it.

First, greet them: 你好！我是通义千问，由通义实验室研发的超大规模语言模型。

Then, mention my capabilities: 我可以回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、逻辑推理、编程等等。

Add a friendly note: 有什么需要帮助的吗？😊

Wait, but the user just asked to introduce myself, so maybe keep it concise. Let me check the example responses. Wait, the user's instruction says "Hi, introduce yourself" so the introduction should be brief.

Wait, the user might be testing if I can follow the language instruction. So I should write in Chinese.

So the correct response is in Chinese. Let me make sure it's natural.

"你好！我是通义千问，阿里巴巴集团旗下的通义实验室自主研发的超大规模语言模型。我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊"

Wait, but the user might not know what "通义实验室" is, but it's part of the official name. Maybe just say "阿里巴巴通义实验室" or "通义实验室".

Wait, the official name is "通义实验室", so it's fine.

Also, make sure to mention that I'm a language model, so "超大规模语言模型".

Check for grammar: "我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、进行逻辑推理、编程等。"

Yes, that's correct.

Now, check if the user's message is in English but the response should be in Chinese. So the response should be in Chinese.

Wait, the user's message is in English, but the system says to respond in Chinese. So the response must be in Chinese.

So the final response is:

你好！我是通义千问，阿里巴巴通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊

Wait, "阿里巴巴通义实验室" or "通义实验室"? The official name is "通义实验室", so maybe just "通义实验室" since it's part of the company's name.

Wait, the official website says "通义实验室" under Alibaba Cloud. So it's correct to say "通义实验室研发的".

So the response should be:

你好！我是通义千问，通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊

Yes, that's good. Make sure to use Chinese punctuation. Also, use the correct Chinese terms for the tasks.

Check for any errors. "创作文字" is correct, as in creating text.

Okay, that should be the response.
...done thinking.

你好！我是通义千问，通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊

>>>

logs

Started ollama.service - Ollama Service.
Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.943+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.945+03:00 level=INFO source=images.go:518 msg="total blobs: 43"
Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0"
Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.2)"
Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB"
Sep 26 10:30:49 ollama[6664]: [GIN] 2025/09/26 - 10:30:49 | 200 |       37.71µs |       127.0.0.1 | HEAD     "/"
Sep 26 10:30:49 ollama[6664]: [GIN] 2025/09/26 - 10:30:49 | 200 |    1.200331ms |       127.0.0.1 | GET      "/api/tags"
Sep 26 10:31:11 ollama[6664]: [GIN] 2025/09/26 - 10:31:11 | 200 |       20.92µs |       127.0.0.1 | HEAD     "/"
Sep 26 10:31:11 ollama[6664]: [GIN] 2025/09/26 - 10:31:11 | 200 |    50.76395ms |       127.0.0.1 | POST     "/api/show"
Sep 26 10:31:11 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 (version GGUF V3 (latest))
Sep 26 10:31:11 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = Magistral-Small-2509
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   3:                            general.version str              = 2509
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   4:                           general.basename str              = Magistral-Small-2509
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   6:                         general.size_label str              = Small
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   7:                            general.license str              = apache-2.0
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  10:                  general.base_model.0.name str              = Mistral Small 3.2 24B Instruct 2506
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  11:               general.base_model.0.version str              = 2506
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  12:          general.base_model.0.organization str              = Mistralai
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mist...
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["vllm", "mistral-common"]
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  15:                          general.languages arr[str,24]      = ["en", "fr", "de", "es", "pt", "it", ...
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  16:                          llama.block_count u32              = 40
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  17:                       llama.context_length u32              = 131072
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  18:                     llama.embedding_length u32              = 5120
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  19:                  llama.feed_forward_length u32              = 32768
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  20:                 llama.attention.head_count u32              = 32
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  21:              llama.attention.head_count_kv u32              = 8
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  22:                       llama.rope.freq_base f32              = 1000000000.000000
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  24:                 llama.attention.key_length u32              = 128
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  25:               llama.attention.value_length u32              = 128
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  26:                          general.file_type u32              = 32
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  27:                           llama.vocab_size u32              = 131072
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  28:                 llama.rope.dimension_count u32              = 128
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = tekken
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Sep 26 10:31:11 ollama[6664]: [132B blob data]
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 1
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 2
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 0
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 11
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  40:               tokenizer.ggml.add_sep_token bool             = false
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {#- Copyright 2025-present the Unslot...
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - type  f32:   81 tensors
Sep 26 10:31:11 ollama[6664]: llama_model_loader: - type bf16:  282 tensors
Sep 26 10:31:11 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:31:11 ollama[6664]: print_info: file type   = BF16
Sep 26 10:31:11 ollama[6664]: print_info: file size   = 43.91 GiB (16.00 BPW)
Sep 26 10:31:11 ollama[6664]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:31:11 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:31:11 ollama[6664]: load:   - 2 ('</s>')
Sep 26 10:31:11 ollama[6664]: load: special tokens cache size = 1000
Sep 26 10:31:11 ollama[6664]: load: token to piece cache size = 0.8498 MB
Sep 26 10:31:11 ollama[6664]: print_info: arch             = llama
Sep 26 10:31:11 ollama[6664]: print_info: vocab_only       = 1
Sep 26 10:31:11 ollama[6664]: print_info: model type       = ?B
Sep 26 10:31:11 ollama[6664]: print_info: model params     = 23.57 B
Sep 26 10:31:11 ollama[6664]: print_info: general.name     = Magistral-Small-2509
Sep 26 10:31:11 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:31:11 ollama[6664]: print_info: n_vocab          = 131072
Sep 26 10:31:11 ollama[6664]: print_info: n_merges         = 269443
Sep 26 10:31:11 ollama[6664]: print_info: BOS token        = 1 '<s>'
Sep 26 10:31:11 ollama[6664]: print_info: EOS token        = 2 '</s>'
Sep 26 10:31:11 ollama[6664]: print_info: UNK token        = 0 '<unk>'
Sep 26 10:31:11 ollama[6664]: print_info: PAD token        = 11 '<pad>'
Sep 26 10:31:11 ollama[6664]: print_info: LF token         = 1010 'Ċ'
Sep 26 10:31:11 ollama[6664]: print_info: EOG token        = 2 '</s>'
Sep 26 10:31:11 ollama[6664]: print_info: max token length = 150
Sep 26 10:31:11 ollama[6664]: llama_model_load: vocab only - skipping tensors
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.042+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.042+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 --port 41649"
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.050+03:00 level=INFO source=runner.go:864 msg="starting go runner"
Sep 26 10:31:12 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: found 3 CUDA devices:
Sep 26 10:31:12 ollama[6664]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 26 10:31:12 ollama[6664]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 26 10:31:12 ollama[6664]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 26 10:31:12 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.120+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.120+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:41649"
Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.374+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.8 GiB" free_swap="8.0 GiB"
Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.013+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 library=cuda parallel=1 required="52.4 GiB" gpus=2
Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.334+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="[23 18]" memory.available="[31.4 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="52.4 GiB" memory.required.partial="52.4 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[29.5 GiB 23.0 GiB]" memory.weights.total="42.7 GiB" memory.weights.repeating="41.4 GiB" memory.weights.nonrepeating="1.3 GiB" memory.graph.full="1.4 GiB" memory.graph.partial="1.4 GiB" projector.weights="838.5 MiB" projector.graph="0 B"
Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:41[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:23(0..22) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:18(23..40)] MultiUserCache:false ProjectorPath:/ai/llm/models/blobs/sha256-d7bca1808bd578add6687383b4c9d53b7b2b07e049dd926362df6fdc3bf96308 MainGPU:0 UseMmap:true}"
Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23676 MiB free
Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23685 MiB free
Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32183 MiB free
Sep 26 10:31:13 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 (version GGUF V3 (latest))
Sep 26 10:31:13 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = Magistral-Small-2509
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   3:                            general.version str              = 2509
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   4:                           general.basename str              = Magistral-Small-2509
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   6:                         general.size_label str              = Small
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   7:                            general.license str              = apache-2.0
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  10:                  general.base_model.0.name str              = Mistral Small 3.2 24B Instruct 2506
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  11:               general.base_model.0.version str              = 2506
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  12:          general.base_model.0.organization str              = Mistralai
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mist...
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["vllm", "mistral-common"]
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  15:                          general.languages arr[str,24]      = ["en", "fr", "de", "es", "pt", "it", ...
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  16:                          llama.block_count u32              = 40
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  17:                       llama.context_length u32              = 131072
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  18:                     llama.embedding_length u32              = 5120
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  19:                  llama.feed_forward_length u32              = 32768
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  20:                 llama.attention.head_count u32              = 32
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  21:              llama.attention.head_count_kv u32              = 8
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  22:                       llama.rope.freq_base f32              = 1000000000.000000
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  23:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  24:                 llama.attention.key_length u32              = 128
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  25:               llama.attention.value_length u32              = 128
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  26:                          general.file_type u32              = 32
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  27:                           llama.vocab_size u32              = 131072
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  28:                 llama.rope.dimension_count u32              = 128
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = tekken
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Sep 26 10:31:13 ollama[6664]: [132B blob data]
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 1
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 2
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 0
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 11
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  40:               tokenizer.ggml.add_sep_token bool             = false
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {#- Copyright 2025-present the Unslot...
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv  43:            tokenizer.ggml.add_space_prefix bool             = false
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - type  f32:   81 tensors
Sep 26 10:31:13 ollama[6664]: llama_model_loader: - type bf16:  282 tensors
Sep 26 10:31:13 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:31:13 ollama[6664]: print_info: file type   = BF16
Sep 26 10:31:13 ollama[6664]: print_info: file size   = 43.91 GiB (16.00 BPW)
Sep 26 10:31:13 ollama[6664]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:31:13 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:31:13 ollama[6664]: load:   - 2 ('</s>')
Sep 26 10:31:13 ollama[6664]: load: special tokens cache size = 1000
Sep 26 10:31:13 ollama[6664]: load: token to piece cache size = 0.8498 MB
Sep 26 10:31:13 ollama[6664]: print_info: arch             = llama
Sep 26 10:31:13 ollama[6664]: print_info: vocab_only       = 0
Sep 26 10:31:13 ollama[6664]: print_info: n_ctx_train      = 131072
Sep 26 10:31:13 ollama[6664]: print_info: n_embd           = 5120
Sep 26 10:31:13 ollama[6664]: print_info: n_layer          = 40
Sep 26 10:31:13 ollama[6664]: print_info: n_head           = 32
Sep 26 10:31:13 ollama[6664]: print_info: n_head_kv        = 8
Sep 26 10:31:13 ollama[6664]: print_info: n_rot            = 128
Sep 26 10:31:13 ollama[6664]: print_info: n_swa            = 0
Sep 26 10:31:13 ollama[6664]: print_info: is_swa_any       = 0
Sep 26 10:31:13 ollama[6664]: print_info: n_embd_head_k    = 128
Sep 26 10:31:13 ollama[6664]: print_info: n_embd_head_v    = 128
Sep 26 10:31:13 ollama[6664]: print_info: n_gqa            = 4
Sep 26 10:31:13 ollama[6664]: print_info: n_embd_k_gqa     = 1024
Sep 26 10:31:13 ollama[6664]: print_info: n_embd_v_gqa     = 1024
Sep 26 10:31:13 ollama[6664]: print_info: f_norm_eps       = 0.0e+00
Sep 26 10:31:13 ollama[6664]: print_info: f_norm_rms_eps   = 1.0e-05
Sep 26 10:31:13 ollama[6664]: print_info: f_clamp_kqv      = 0.0e+00
Sep 26 10:31:13 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00
Sep 26 10:31:13 ollama[6664]: print_info: f_logit_scale    = 0.0e+00
Sep 26 10:31:13 ollama[6664]: print_info: f_attn_scale     = 0.0e+00
Sep 26 10:31:13 ollama[6664]: print_info: n_ff             = 32768
Sep 26 10:31:13 ollama[6664]: print_info: n_expert         = 0
Sep 26 10:31:13 ollama[6664]: print_info: n_expert_used    = 0
Sep 26 10:31:13 ollama[6664]: print_info: causal attn      = 1
Sep 26 10:31:13 ollama[6664]: print_info: pooling type     = 0
Sep 26 10:31:13 ollama[6664]: print_info: rope type        = 0
Sep 26 10:31:13 ollama[6664]: print_info: rope scaling     = linear
Sep 26 10:31:13 ollama[6664]: print_info: freq_base_train  = 1000000000.0
Sep 26 10:31:13 ollama[6664]: print_info: freq_scale_train = 1
Sep 26 10:31:13 ollama[6664]: print_info: n_ctx_orig_yarn  = 131072
Sep 26 10:31:13 ollama[6664]: print_info: rope_finetuned   = unknown
Sep 26 10:31:13 ollama[6664]: print_info: model type       = 13B
Sep 26 10:31:13 ollama[6664]: print_info: model params     = 23.57 B
Sep 26 10:31:13 ollama[6664]: print_info: general.name     = Magistral-Small-2509
Sep 26 10:31:13 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:31:13 ollama[6664]: print_info: n_vocab          = 131072
Sep 26 10:31:13 ollama[6664]: print_info: n_merges         = 269443
Sep 26 10:31:13 ollama[6664]: print_info: BOS token        = 1 '<s>'
Sep 26 10:31:13 ollama[6664]: print_info: EOS token        = 2 '</s>'
Sep 26 10:31:13 ollama[6664]: print_info: UNK token        = 0 '<unk>'
Sep 26 10:31:13 ollama[6664]: print_info: PAD token        = 11 '<pad>'
Sep 26 10:31:13 ollama[6664]: print_info: LF token         = 1010 'Ċ'
Sep 26 10:31:13 ollama[6664]: print_info: EOG token        = 2 '</s>'
Sep 26 10:31:13 ollama[6664]: print_info: max token length = 150
Sep 26 10:31:13 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = true)
Sep 26 10:31:29 ollama[6664]: load_tensors: offloading 40 repeating layers to GPU
Sep 26 10:31:29 ollama[6664]: load_tensors: offloading output layer to GPU
Sep 26 10:31:29 ollama[6664]: load_tensors: offloaded 41/41 layers to GPU
Sep 26 10:31:29 ollama[6664]: load_tensors:        CUDA1 model buffer size = 19080.70 MiB
Sep 26 10:31:29 ollama[6664]: load_tensors:        CUDA2 model buffer size = 24600.88 MiB
Sep 26 10:31:29 ollama[6664]: load_tensors:   CPU_Mapped model buffer size =  1280.00 MiB
Sep 26 10:31:44 ollama[6664]: llama_context: constructing llama_context
Sep 26 10:31:44 ollama[6664]: llama_context: n_seq_max     = 1
Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx         = 20000
Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx_per_seq = 20000
Sep 26 10:31:44 ollama[6664]: llama_context: n_batch       = 512
Sep 26 10:31:44 ollama[6664]: llama_context: n_ubatch      = 512
Sep 26 10:31:44 ollama[6664]: llama_context: causal_attn   = 1
Sep 26 10:31:44 ollama[6664]: llama_context: flash_attn    = 1
Sep 26 10:31:44 ollama[6664]: llama_context: kv_unified    = false
Sep 26 10:31:44 ollama[6664]: llama_context: freq_base     = 1000000000.0
Sep 26 10:31:44 ollama[6664]: llama_context: freq_scale    = 1
Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Sep 26 10:31:44 ollama[6664]: llama_context:  CUDA_Host  output buffer size =     0.52 MiB
Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified:      CUDA1 KV buffer size =  1422.00 MiB
Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified:      CUDA2 KV buffer size =  1738.00 MiB
Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified: size = 3160.00 MiB ( 20224 cells,  40 layers,  1/1 seqs), K (f16): 1580.00 MiB, V (f16): 1580.00 MiB
Sep 26 10:31:44 ollama[6664]: llama_context: pipeline parallelism enabled (n_copies=4)
Sep 26 10:31:44 ollama[6664]: llama_context:      CUDA1 compute buffer size =   445.79 MiB
Sep 26 10:31:44 ollama[6664]: llama_context:      CUDA2 compute buffer size =   385.05 MiB
Sep 26 10:31:44 ollama[6664]: llama_context:  CUDA_Host compute buffer size =   168.05 MiB
Sep 26 10:31:44 ollama[6664]: llama_context: graph nodes  = 1247
Sep 26 10:31:44 ollama[6664]: llama_context: graph splits = 3
Sep 26 10:31:44 ollama[6664]: clip_model_loader: model name:   Magistral-Small-2509
Sep 26 10:31:44 ollama[6664]: clip_model_loader: description:
Sep 26 10:31:44 ollama[6664]: clip_model_loader: GGUF version: 3
Sep 26 10:31:44 ollama[6664]: clip_model_loader: alignment:    32
Sep 26 10:31:44 ollama[6664]: clip_model_loader: n_tensors:    223
Sep 26 10:31:44 ollama[6664]: clip_model_loader: n_kv:         32
Sep 26 10:31:44 ollama[6664]: clip_model_loader: has vision encoder
Sep 26 10:31:44 ollama[6664]: clip_ctx: CLIP using CUDA0 backend
Sep 26 10:31:44 ollama[6664]: load_hparams: projector:          pixtral
Sep 26 10:31:44 ollama[6664]: load_hparams: n_embd:             1024
Sep 26 10:31:44 ollama[6664]: load_hparams: n_head:             16
Sep 26 10:31:44 ollama[6664]: load_hparams: n_ff:               4096
Sep 26 10:31:44 ollama[6664]: load_hparams: n_layer:            24
Sep 26 10:31:44 ollama[6664]: load_hparams: ffn_op:             silu
Sep 26 10:31:44 ollama[6664]: load_hparams: projection_dim:     5120
Sep 26 10:31:44 ollama[6664]: --- vision hparams ---
Sep 26 10:31:44 ollama[6664]: load_hparams: image_size:         1024
Sep 26 10:31:44 ollama[6664]: load_hparams: patch_size:         14
Sep 26 10:31:44 ollama[6664]: load_hparams: has_llava_proj:     0
Sep 26 10:31:44 ollama[6664]: load_hparams: minicpmv_version:   0
Sep 26 10:31:44 ollama[6664]: load_hparams: proj_scale_factor:  0
Sep 26 10:31:44 ollama[6664]: load_hparams: n_wa_pattern:       0
Sep 26 10:31:44 ollama[6664]: load_hparams: model size:         838.51 MiB
Sep 26 10:31:44 ollama[6664]: load_hparams: metadata size:      0.08 MiB
Sep 26 10:31:45 ollama[6664]: alloc_compute_meta:      CUDA0 compute buffer size =     3.97 MiB
Sep 26 10:31:45 ollama[6664]: alloc_compute_meta:        CPU compute buffer size =     0.14 MiB
Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.420+03:00 level=INFO source=server.go:1289 msg="llama runner started in 33.38 seconds"
Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=server.go:1289 msg="llama runner started in 33.38 seconds"
Sep 26 10:31:45 ollama[6664]: [GIN] 2025/09/26 - 10:31:45 | 200 | 34.347339414s |       127.0.0.1 | POST     "/api/generate"
Sep 26 10:32:07 ollama[6664]: [GIN] 2025/09/26 - 10:32:07 | 200 | 11.957087221s |       127.0.0.1 | POST     "/api/chat"
Sep 26 10:32:29 ollama[6664]: [GIN] 2025/09/26 - 10:32:29 | 200 |      15.369µs |       127.0.0.1 | HEAD     "/"
Sep 26 10:32:29 ollama[6664]: [GIN] 2025/09/26 - 10:32:29 | 200 |    53.48872ms |       127.0.0.1 | POST     "/api/show"
Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="21.9 GiB"
Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="578.0 MiB"
Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="2.3 GiB"
Sep 26 10:32:29 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 803 tensors from /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 (version GGUF V3 (latest))
Sep 26 10:32:29 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = GLM 4.5 Air
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   3:                         general.size_label str              = 128x9.4B
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   4:                            general.license str              = mit
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   6:                          general.languages arr[str,2]       = ["en", "zh"]
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   7:                        glm4moe.block_count u32              = 47
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   8:                     glm4moe.context_length u32              = 131072
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv   9:                   glm4moe.embedding_length u32              = 4096
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  10:                glm4moe.feed_forward_length u32              = 10944
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  11:               glm4moe.attention.head_count u32              = 96
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  12:            glm4moe.attention.head_count_kv u32              = 8
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  13:                     glm4moe.rope.freq_base f32              = 1000000.000000
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  14:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  15:                  glm4moe.expert_used_count u32              = 8
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  16:               glm4moe.attention.key_length u32              = 128
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  17:             glm4moe.attention.value_length u32              = 128
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  18:               glm4moe.rope.dimension_count u32              = 64
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  19:                       glm4moe.expert_count u32              = 128
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  20:         glm4moe.expert_feed_forward_length u32              = 1408
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  21:                glm4moe.expert_shared_count u32              = 1
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  22:          glm4moe.leading_dense_block_count u32              = 1
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  23:                 glm4moe.expert_gating_func u32              = 2
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  24:               glm4moe.expert_weights_scale f32              = 1.000000
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  25:                glm4moe.expert_weights_norm bool             = true
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  26:               glm4moe.nextn_predict_layers u32              = 1
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = glm4
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 151329
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 151329
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 151331
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  35:                tokenizer.ggml.eot_token_id u32              = 151336
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  36:            tokenizer.ggml.unknown_token_id u32              = 151329
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  37:                tokenizer.ggml.eom_token_id u32              = 151338
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  38:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  39:               general.quantization_version u32              = 2
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  40:                          general.file_type u32              = 7
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  41:                                   split.no u16              = 0
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  42:                        split.tensors.count i32              = 803
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv  43:                                split.count u16              = 0
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - type  f32:  331 tensors
Sep 26 10:32:29 ollama[6664]: llama_model_loader: - type q8_0:  472 tensors
Sep 26 10:32:29 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:32:29 ollama[6664]: print_info: file type   = Q8_0
Sep 26 10:32:29 ollama[6664]: print_info: file size   = 109.38 GiB (8.51 BPW)
Sep 26 10:32:29 ollama[6664]: load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:32:29 ollama[6664]: load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:32:29 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:32:29 ollama[6664]: load:   - 151329 ('<|endoftext|>')
Sep 26 10:32:29 ollama[6664]: load:   - 151336 ('<|user|>')
Sep 26 10:32:29 ollama[6664]: load:   - 151338 ('<|observation|>')
Sep 26 10:32:29 ollama[6664]: load: special tokens cache size = 36
Sep 26 10:32:29 ollama[6664]: load: token to piece cache size = 0.9713 MB
Sep 26 10:32:29 ollama[6664]: print_info: arch             = glm4moe
Sep 26 10:32:29 ollama[6664]: print_info: vocab_only       = 1
Sep 26 10:32:29 ollama[6664]: print_info: model type       = ?B
Sep 26 10:32:29 ollama[6664]: print_info: model params     = 110.47 B
Sep 26 10:32:29 ollama[6664]: print_info: general.name     = GLM 4.5 Air
Sep 26 10:32:29 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:32:29 ollama[6664]: print_info: n_vocab          = 151552
Sep 26 10:32:29 ollama[6664]: print_info: n_merges         = 318088
Sep 26 10:32:29 ollama[6664]: print_info: BOS token        = 151331 '[gMASK]'
Sep 26 10:32:29 ollama[6664]: print_info: EOS token        = 151329 '<|endoftext|>'
Sep 26 10:32:29 ollama[6664]: print_info: EOT token        = 151336 '<|user|>'
Sep 26 10:32:29 ollama[6664]: print_info: EOM token        = 151338 '<|observation|>'
Sep 26 10:32:29 ollama[6664]: print_info: UNK token        = 151329 '<|endoftext|>'
Sep 26 10:32:29 ollama[6664]: print_info: PAD token        = 151329 '<|endoftext|>'
Sep 26 10:32:29 ollama[6664]: print_info: LF token         = 198 'Ċ'
Sep 26 10:32:29 ollama[6664]: print_info: FIM PRE token    = 151347 '<|code_prefix|>'
Sep 26 10:32:29 ollama[6664]: print_info: FIM SUF token    = 151349 '<|code_suffix|>'
Sep 26 10:32:29 ollama[6664]: print_info: FIM MID token    = 151348 '<|code_middle|>'
Sep 26 10:32:29 ollama[6664]: print_info: EOG token        = 151329 '<|endoftext|>'
Sep 26 10:32:29 ollama[6664]: print_info: EOG token        = 151336 '<|user|>'
Sep 26 10:32:29 ollama[6664]: print_info: EOG token        = 151338 '<|observation|>'
Sep 26 10:32:29 ollama[6664]: print_info: max token length = 1024
Sep 26 10:32:29 ollama[6664]: llama_model_load: vocab only - skipping tensors
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.311+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.311+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 --port 44563"
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.319+03:00 level=INFO source=runner.go:864 msg="starting go runner"
Sep 26 10:32:30 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: found 3 CUDA devices:
Sep 26 10:32:30 ollama[6664]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 26 10:32:30 ollama[6664]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 26 10:32:30 ollama[6664]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 26 10:32:30 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.389+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.390+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:44563"
Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.630+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="168.8 GiB" free_swap="8.0 GiB"
Sep 26 10:32:31 ollama[6664]: time=2025-09-26T10:32:31.561+03:00 level=INFO source=server.go:511 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
Sep 26 10:32:34 ollama[6664]: time=2025-09-26T10:32:34.525+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.6 GiB" free_swap="8.0 GiB"
Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.820+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=48 layers.offload=20 layers.split="[6 5 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="136.2 GiB" memory.required.partial="73.0 GiB" memory.required.kv="3.6 GiB" memory.required.allocations="[22.3 GiB 21.2 GiB 29.5 GiB]" memory.weights.total="108.8 GiB" memory.weights.repeating="108.2 GiB" memory.weights.nonrepeating="629.0 MiB" memory.graph.full="7.2 GiB" memory.graph.partial="7.2 GiB"
Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:20[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:6(27..32) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:5(33..37) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:9(38..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 26 10:32:35 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23620 MiB free
Sep 26 10:32:35 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23683 MiB free
Sep 26 10:32:36 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32109 MiB free
Sep 26 10:32:36 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 803 tensors from /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 (version GGUF V3 (latest))
Sep 26 10:32:36 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = glm4moe
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = GLM 4.5 Air
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   3:                         general.size_label str              = 128x9.4B
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   4:                            general.license str              = mit
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   5:                               general.tags arr[str,1]       = ["text-generation"]
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   6:                          general.languages arr[str,2]       = ["en", "zh"]
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   7:                        glm4moe.block_count u32              = 47
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   8:                     glm4moe.context_length u32              = 131072
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv   9:                   glm4moe.embedding_length u32              = 4096
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  10:                glm4moe.feed_forward_length u32              = 10944
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  11:               glm4moe.attention.head_count u32              = 96
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  12:            glm4moe.attention.head_count_kv u32              = 8
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  13:                     glm4moe.rope.freq_base f32              = 1000000.000000
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  14:   glm4moe.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  15:                  glm4moe.expert_used_count u32              = 8
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  16:               glm4moe.attention.key_length u32              = 128
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  17:             glm4moe.attention.value_length u32              = 128
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  18:               glm4moe.rope.dimension_count u32              = 64
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  19:                       glm4moe.expert_count u32              = 128
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  20:         glm4moe.expert_feed_forward_length u32              = 1408
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  21:                glm4moe.expert_shared_count u32              = 1
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  22:          glm4moe.leading_dense_block_count u32              = 1
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  23:                 glm4moe.expert_gating_func u32              = 2
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  24:               glm4moe.expert_weights_scale f32              = 1.000000
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  25:                glm4moe.expert_weights_norm bool             = true
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  26:               glm4moe.nextn_predict_layers u32              = 1
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = glm4
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,318088]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 151329
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 151329
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 151331
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  35:                tokenizer.ggml.eot_token_id u32              = 151336
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  36:            tokenizer.ggml.unknown_token_id u32              = 151329
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  37:                tokenizer.ggml.eom_token_id u32              = 151338
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  38:                    tokenizer.chat_template str              = [gMASK]<sop>\n{%- if tools -%}\n<|syste...
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  39:               general.quantization_version u32              = 2
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  40:                          general.file_type u32              = 7
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  41:                                   split.no u16              = 0
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  42:                        split.tensors.count i32              = 803
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv  43:                                split.count u16              = 0
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - type  f32:  331 tensors
Sep 26 10:32:36 ollama[6664]: llama_model_loader: - type q8_0:  472 tensors
Sep 26 10:32:36 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:32:36 ollama[6664]: print_info: file type   = Q8_0
Sep 26 10:32:36 ollama[6664]: print_info: file size   = 109.38 GiB (8.51 BPW)
Sep 26 10:32:36 ollama[6664]: load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:32:36 ollama[6664]: load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
Sep 26 10:32:36 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:32:36 ollama[6664]: load:   - 151329 ('<|endoftext|>')
Sep 26 10:32:36 ollama[6664]: load:   - 151336 ('<|user|>')
Sep 26 10:32:36 ollama[6664]: load:   - 151338 ('<|observation|>')
Sep 26 10:32:36 ollama[6664]: load: special tokens cache size = 36
Sep 26 10:32:36 ollama[6664]: load: token to piece cache size = 0.9713 MB
Sep 26 10:32:36 ollama[6664]: print_info: arch             = glm4moe
Sep 26 10:32:36 ollama[6664]: print_info: vocab_only       = 0
Sep 26 10:32:36 ollama[6664]: print_info: n_ctx_train      = 131072
Sep 26 10:32:36 ollama[6664]: print_info: n_embd           = 4096
Sep 26 10:32:36 ollama[6664]: print_info: n_layer          = 47
Sep 26 10:32:36 ollama[6664]: print_info: n_head           = 96
Sep 26 10:32:36 ollama[6664]: print_info: n_head_kv        = 8
Sep 26 10:32:36 ollama[6664]: print_info: n_rot            = 64
Sep 26 10:32:36 ollama[6664]: print_info: n_swa            = 0
Sep 26 10:32:36 ollama[6664]: print_info: is_swa_any       = 0
Sep 26 10:32:36 ollama[6664]: print_info: n_embd_head_k    = 128
Sep 26 10:32:36 ollama[6664]: print_info: n_embd_head_v    = 128
Sep 26 10:32:36 ollama[6664]: print_info: n_gqa            = 12
Sep 26 10:32:36 ollama[6664]: print_info: n_embd_k_gqa     = 1024
Sep 26 10:32:36 ollama[6664]: print_info: n_embd_v_gqa     = 1024
Sep 26 10:32:36 ollama[6664]: print_info: f_norm_eps       = 0.0e+00
Sep 26 10:32:36 ollama[6664]: print_info: f_norm_rms_eps   = 1.0e-05
Sep 26 10:32:36 ollama[6664]: print_info: f_clamp_kqv      = 0.0e+00
Sep 26 10:32:36 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00
Sep 26 10:32:36 ollama[6664]: print_info: f_logit_scale    = 0.0e+00
Sep 26 10:32:36 ollama[6664]: print_info: f_attn_scale     = 0.0e+00
Sep 26 10:32:36 ollama[6664]: print_info: n_ff             = 10944
Sep 26 10:32:36 ollama[6664]: print_info: n_expert         = 128
Sep 26 10:32:36 ollama[6664]: print_info: n_expert_used    = 8
Sep 26 10:32:36 ollama[6664]: print_info: causal attn      = 1
Sep 26 10:32:36 ollama[6664]: print_info: pooling type     = 0
Sep 26 10:32:36 ollama[6664]: print_info: rope type        = 2
Sep 26 10:32:36 ollama[6664]: print_info: rope scaling     = linear
Sep 26 10:32:36 ollama[6664]: print_info: freq_base_train  = 1000000.0
Sep 26 10:32:36 ollama[6664]: print_info: freq_scale_train = 1
Sep 26 10:32:36 ollama[6664]: print_info: n_ctx_orig_yarn  = 131072
Sep 26 10:32:36 ollama[6664]: print_info: rope_finetuned   = unknown
Sep 26 10:32:36 ollama[6664]: print_info: model type       = 106B.A12B
Sep 26 10:32:36 ollama[6664]: print_info: model params     = 110.47 B
Sep 26 10:32:36 ollama[6664]: print_info: general.name     = GLM 4.5 Air
Sep 26 10:32:36 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:32:36 ollama[6664]: print_info: n_vocab          = 151552
Sep 26 10:32:36 ollama[6664]: print_info: n_merges         = 318088
Sep 26 10:32:36 ollama[6664]: print_info: BOS token        = 151331 '[gMASK]'
Sep 26 10:32:36 ollama[6664]: print_info: EOS token        = 151329 '<|endoftext|>'
Sep 26 10:32:36 ollama[6664]: print_info: EOT token        = 151336 '<|user|>'
Sep 26 10:32:36 ollama[6664]: print_info: EOM token        = 151338 '<|observation|>'
Sep 26 10:32:36 ollama[6664]: print_info: UNK token        = 151329 '<|endoftext|>'
Sep 26 10:32:36 ollama[6664]: print_info: PAD token        = 151329 '<|endoftext|>'
Sep 26 10:32:36 ollama[6664]: print_info: LF token         = 198 'Ċ'
Sep 26 10:32:36 ollama[6664]: print_info: FIM PRE token    = 151347 '<|code_prefix|>'
Sep 26 10:32:36 ollama[6664]: print_info: FIM SUF token    = 151349 '<|code_suffix|>'
Sep 26 10:32:36 ollama[6664]: print_info: FIM MID token    = 151348 '<|code_middle|>'
Sep 26 10:32:36 ollama[6664]: print_info: EOG token        = 151329 '<|endoftext|>'
Sep 26 10:32:36 ollama[6664]: print_info: EOG token        = 151336 '<|user|>'
Sep 26 10:32:36 ollama[6664]: print_info: EOG token        = 151338 '<|observation|>'
Sep 26 10:32:36 ollama[6664]: print_info: max token length = 1024
Sep 26 10:32:36 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = true)
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_q.weight (size = 53477376 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_k.weight (size = 4456448 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_v.weight (size = 4456448 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_output.weight (size = 53477376 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_exps.weight (size = 784334848 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_down_exps.weight (size = 784334848 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_up_exps.weight (size = 784334848 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_shexp.weight (size = 6127616 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_down_shexp.weight (size = 6127616 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_up_shexp.weight (size = 6127616 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.embed_tokens.weight (size = 659554304 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.shared_head_head.weight (size = 659554304 bytes) -- ignoring
Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring
Sep 26 10:33:28 ollama[6664]: load_tensors: offloading 20 repeating layers to GPU
Sep 26 10:33:28 ollama[6664]: load_tensors: offloaded 20/48 layers to GPU
Sep 26 10:33:28 ollama[6664]: load_tensors:        CUDA0 model buffer size = 14244.71 MiB
Sep 26 10:33:28 ollama[6664]: load_tensors:        CUDA1 model buffer size = 11870.59 MiB
Sep 26 10:33:28 ollama[6664]: load_tensors:        CUDA2 model buffer size = 18992.95 MiB
Sep 26 10:33:28 ollama[6664]: load_tensors:   CPU_Mapped model buffer size = 63231.93 MiB
Sep 26 10:33:41 ollama[6664]: llama_context: constructing llama_context
Sep 26 10:33:41 ollama[6664]: llama_context: n_seq_max     = 1
Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx         = 20000
Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx_per_seq = 20000
Sep 26 10:33:41 ollama[6664]: llama_context: n_batch       = 512
Sep 26 10:33:41 ollama[6664]: llama_context: n_ubatch      = 512
Sep 26 10:33:41 ollama[6664]: llama_context: causal_attn   = 1
Sep 26 10:33:41 ollama[6664]: llama_context: flash_attn    = 1
Sep 26 10:33:41 ollama[6664]: llama_context: kv_unified    = false
Sep 26 10:33:41 ollama[6664]: llama_context: freq_base     = 1000000.0
Sep 26 10:33:41 ollama[6664]: llama_context: freq_scale    = 1
Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Sep 26 10:33:41 ollama[6664]: llama_context:        CPU  output buffer size =     0.59 MiB
Sep 26 10:33:41 ollama[6664]: llama_kv_cache_unified:      CUDA0 KV buffer size =   474.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified:      CUDA1 KV buffer size =   395.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified:      CUDA2 KV buffer size =   632.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified:        CPU KV buffer size =  2133.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified: size = 3634.00 MiB ( 20224 cells,  46 layers,  1/1 seqs), K (f16): 1817.00 MiB, V (f16): 1817.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_context:      CUDA0 compute buffer size =   933.00 MiB
Sep 26 10:33:42 ollama[6664]: llama_context:      CUDA1 compute buffer size =   166.51 MiB
Sep 26 10:33:42 ollama[6664]: llama_context:      CUDA2 compute buffer size =   166.51 MiB
Sep 26 10:33:42 ollama[6664]: llama_context:  CUDA_Host compute buffer size =    47.51 MiB
Sep 26 10:33:42 ollama[6664]: llama_context: graph nodes  = 3101
Sep 26 10:33:42 ollama[6664]: llama_context: graph splits = 514 (with bs=512), 5 (with bs=1)
Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1289 msg="llama runner started in 72.44 seconds"
Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1289 msg="llama runner started in 72.44 seconds"
Sep 26 10:33:42 ollama[6664]: [GIN] 2025/09/26 - 10:33:42 | 200 |         1m13s |       127.0.0.1 | POST     "/api/generate"
Sep 26 10:34:23 ollama[6664]: [GIN] 2025/09/26 - 10:34:23 | 200 | 35.476925376s |       127.0.0.1 | POST     "/api/chat"
Sep 26 10:34:50 ollama[6664]: [GIN] 2025/09/26 - 10:34:50 | 200 |      28.123µs |       127.0.0.1 | HEAD     "/"
Sep 26 10:34:51 ollama[6664]: [GIN] 2025/09/26 - 10:34:51 | 200 |   44.412727ms |       127.0.0.1 | POST     "/api/show"
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="1.2 GiB"
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="2.3 GiB"
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="2.2 GiB"
Sep 26 10:34:51 ollama[6664]: llama_model_loader: loaded meta data with 40 key-value pairs and 569 tensors from /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 (version GGUF V3 (latest))
Sep 26 10:34:51 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = deci
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super_V1_5
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   3:                           general.finetune str              = 3_3-Nemotron-Super-v1_5
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   4:                           general.basename str              = Llama
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   5:                         general.size_label str              = 49B
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   6:                            general.license str              = other
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   7:                       general.license.name str              = nvidia-open-model-license
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   8:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv   9:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  11:                        deci.rope.freq_base f32              = 500000.000000
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  12:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  13:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  14:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  15:                           deci.block_count u32              = 80
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  16:                        deci.context_length u32              = 131072
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  17:                      deci.embedding_length u32              = 8192
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  18:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  19:                  deci.attention.key_length u32              = 128
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  20:                deci.attention.value_length u32              = 128
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  21:                            deci.vocab_size u32              = 128256
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  22:                  deci.rope.dimension_count u32              = 128
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 128009
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  32:               tokenizer.ggml.add_sep_token bool             = false
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set bos = "<|begin_of_text|>" %}{%...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  34:               general.quantization_version u32              = 2
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  35:                          general.file_type u32              = 7
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  36:                      quantize.imatrix.file str              = Llama-3_3-Nemotron-Super-49B-v1_5/Lla...
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = calibration_datav3.txt
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  38:             quantize.imatrix.entries_count u32              = 436
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv  39:              quantize.imatrix.chunks_count u32              = 125
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - type  f32:  131 tensors
Sep 26 10:34:51 ollama[6664]: llama_model_loader: - type q8_0:  438 tensors
Sep 26 10:34:51 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:34:51 ollama[6664]: print_info: file type   = Q8_0
Sep 26 10:34:51 ollama[6664]: print_info: file size   = 49.35 GiB (8.50 BPW)
Sep 26 10:34:51 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:34:51 ollama[6664]: load:   - 128001 ('<|end_of_text|>')
Sep 26 10:34:51 ollama[6664]: load:   - 128008 ('<|eom_id|>')
Sep 26 10:34:51 ollama[6664]: load:   - 128009 ('<|eot_id|>')
Sep 26 10:34:51 ollama[6664]: load: special tokens cache size = 256
Sep 26 10:34:51 ollama[6664]: load: token to piece cache size = 0.7999 MB
Sep 26 10:34:51 ollama[6664]: print_info: arch             = deci
Sep 26 10:34:51 ollama[6664]: print_info: vocab_only       = 1
Sep 26 10:34:51 ollama[6664]: print_info: model type       = ?B
Sep 26 10:34:51 ollama[6664]: print_info: model params     = 49.87 B
Sep 26 10:34:51 ollama[6664]: print_info: general.name     = Llama_Nemotron_Super_V1_5
Sep 26 10:34:51 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:34:51 ollama[6664]: print_info: n_vocab          = 128256
Sep 26 10:34:51 ollama[6664]: print_info: n_merges         = 280147
Sep 26 10:34:51 ollama[6664]: print_info: BOS token        = 128000 '<|begin_of_text|>'
Sep 26 10:34:51 ollama[6664]: print_info: EOS token        = 128009 '<|eot_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: EOT token        = 128009 '<|eot_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: EOM token        = 128008 '<|eom_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: PAD token        = 128009 '<|eot_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: LF token         = 198 'Ċ'
Sep 26 10:34:51 ollama[6664]: print_info: EOG token        = 128001 '<|end_of_text|>'
Sep 26 10:34:51 ollama[6664]: print_info: EOG token        = 128008 '<|eom_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: EOG token        = 128009 '<|eot_id|>'
Sep 26 10:34:51 ollama[6664]: print_info: max token length = 256
Sep 26 10:34:51 ollama[6664]: llama_model_load: vocab only - skipping tensors
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.971+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.971+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 --port 39567"
Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.979+03:00 level=INFO source=runner.go:864 msg="starting go runner"
Sep 26 10:34:51 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: found 3 CUDA devices:
Sep 26 10:34:52 ollama[6664]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 26 10:34:52 ollama[6664]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 26 10:34:52 ollama[6664]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 26 10:34:52 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.073+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.073+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:39567"
Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.297+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="166.7 GiB" free_swap="8.0 GiB"
Sep 26 10:34:53 ollama[6664]: time=2025-09-26T10:34:53.236+03:00 level=INFO source=server.go:511 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
Sep 26 10:34:58 ollama[6664]: time=2025-09-26T10:34:58.431+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.194652105 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745
Sep 26 10:34:58 ollama[6664]: time=2025-09-26T10:34:58.763+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.526756847 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745
Sep 26 10:34:59 ollama[6664]: time=2025-09-26T10:34:59.431+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=6.19473457 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745
Sep 26 10:34:59 ollama[6664]: time=2025-09-26T10:34:59.763+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.5 GiB" free_swap="8.0 GiB"
Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.083+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=81 layers.offload=0 layers.split=[] memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="52.0 GiB" memory.required.partial="0 B" memory.required.kv="3.7 GiB" memory.required.allocations="[0 B 0 B 0 B]" memory.weights.total="48.3 GiB" memory.weights.repeating="47.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="39.9 GiB" memory.graph.partial="39.9 GiB"
Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23624 MiB free
Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23667 MiB free
Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32153 MiB free
Sep 26 10:35:01 ollama[6664]: llama_model_loader: loaded meta data with 40 key-value pairs and 569 tensors from /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 (version GGUF V3 (latest))
Sep 26 10:35:01 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   0:                       general.architecture str              = deci
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Super_V1_5
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   3:                           general.finetune str              = 3_3-Nemotron-Super-v1_5
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   4:                           general.basename str              = Llama
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   5:                         general.size_label str              = 49B
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   6:                            general.license str              = other
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   7:                       general.license.name str              = nvidia-open-model-license
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   8:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv   9:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  11:                        deci.rope.freq_base f32              = 500000.000000
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  12:               deci.attention.head_count_kv arr[i32,80]      = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  13:                  deci.attention.head_count arr[i32,80]      = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  14:                   deci.feed_forward_length arr[i32,80]      = [14336, 28672, 28672, 28672, 28672, 2...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  15:                           deci.block_count u32              = 80
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  16:                        deci.context_length u32              = 131072
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  17:                      deci.embedding_length u32              = 8192
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  18:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  19:                  deci.attention.key_length u32              = 128
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  20:                deci.attention.value_length u32              = 128
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  21:                            deci.vocab_size u32              = 128256
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  22:                  deci.rope.dimension_count u32              = 128
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 128009
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  32:               tokenizer.ggml.add_sep_token bool             = false
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set bos = "<|begin_of_text|>" %}{%...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  34:               general.quantization_version u32              = 2
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  35:                          general.file_type u32              = 7
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  36:                      quantize.imatrix.file str              = Llama-3_3-Nemotron-Super-49B-v1_5/Lla...
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = calibration_datav3.txt
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  38:             quantize.imatrix.entries_count u32              = 436
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv  39:              quantize.imatrix.chunks_count u32              = 125
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - type  f32:  131 tensors
Sep 26 10:35:01 ollama[6664]: llama_model_loader: - type q8_0:  438 tensors
Sep 26 10:35:01 ollama[6664]: print_info: file format = GGUF V3 (latest)
Sep 26 10:35:01 ollama[6664]: print_info: file type   = Q8_0
Sep 26 10:35:01 ollama[6664]: print_info: file size   = 49.35 GiB (8.50 BPW)
Sep 26 10:35:01 ollama[6664]: load: printing all EOG tokens:
Sep 26 10:35:01 ollama[6664]: load:   - 128001 ('<|end_of_text|>')
Sep 26 10:35:01 ollama[6664]: load:   - 128008 ('<|eom_id|>')
Sep 26 10:35:01 ollama[6664]: load:   - 128009 ('<|eot_id|>')
Sep 26 10:35:01 ollama[6664]: load: special tokens cache size = 256
Sep 26 10:35:01 ollama[6664]: load: token to piece cache size = 0.7999 MB
Sep 26 10:35:01 ollama[6664]: print_info: arch             = deci
Sep 26 10:35:01 ollama[6664]: print_info: vocab_only       = 0
Sep 26 10:35:01 ollama[6664]: print_info: n_ctx_train      = 131072
Sep 26 10:35:01 ollama[6664]: print_info: n_embd           = 8192
Sep 26 10:35:01 ollama[6664]: print_info: n_layer          = 80
Sep 26 10:35:01 ollama[6664]: print_info: n_head           = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64]
Sep 26 10:35:01 ollama[6664]: print_info: n_head_kv        = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Sep 26 10:35:01 ollama[6664]: print_info: n_rot            = 128
Sep 26 10:35:01 ollama[6664]: print_info: n_swa            = 0
Sep 26 10:35:01 ollama[6664]: print_info: is_swa_any       = 0
Sep 26 10:35:01 ollama[6664]: print_info: n_embd_head_k    = 128
Sep 26 10:35:01 ollama[6664]: print_info: n_embd_head_v    = 128
Sep 26 10:35:01 ollama[6664]: print_info: n_gqa            = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8]
Sep 26 10:35:01 ollama[6664]: print_info: n_embd_k_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
Sep 26 10:35:01 ollama[6664]: print_info: n_embd_v_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
Sep 26 10:35:01 ollama[6664]: print_info: f_norm_eps       = 0.0e+00
Sep 26 10:35:01 ollama[6664]: print_info: f_norm_rms_eps   = 1.0e-05
Sep 26 10:35:01 ollama[6664]: print_info: f_clamp_kqv      = 0.0e+00
Sep 26 10:35:01 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00
Sep 26 10:35:01 ollama[6664]: print_info: f_logit_scale    = 0.0e+00
Sep 26 10:35:01 ollama[6664]: print_info: f_attn_scale     = 0.0e+00
Sep 26 10:35:01 ollama[6664]: print_info: n_ff             = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672]
Sep 26 10:35:01 ollama[6664]: print_info: n_expert         = 0
Sep 26 10:35:01 ollama[6664]: print_info: n_expert_used    = 0
Sep 26 10:35:01 ollama[6664]: print_info: causal attn      = 1
Sep 26 10:35:01 ollama[6664]: print_info: pooling type     = 0
Sep 26 10:35:01 ollama[6664]: print_info: rope type        = 0
Sep 26 10:35:01 ollama[6664]: print_info: rope scaling     = linear
Sep 26 10:35:01 ollama[6664]: print_info: freq_base_train  = 500000.0
Sep 26 10:35:01 ollama[6664]: print_info: freq_scale_train = 1
Sep 26 10:35:01 ollama[6664]: print_info: n_ctx_orig_yarn  = 131072
Sep 26 10:35:01 ollama[6664]: print_info: rope_finetuned   = unknown
Sep 26 10:35:01 ollama[6664]: print_info: model type       = 70B
Sep 26 10:35:01 ollama[6664]: print_info: model params     = 49.87 B
Sep 26 10:35:01 ollama[6664]: print_info: general.name     = Llama_Nemotron_Super_V1_5
Sep 26 10:35:01 ollama[6664]: print_info: vocab type       = BPE
Sep 26 10:35:01 ollama[6664]: print_info: n_vocab          = 128256
Sep 26 10:35:01 ollama[6664]: print_info: n_merges         = 280147
Sep 26 10:35:01 ollama[6664]: print_info: BOS token        = 128000 '<|begin_of_text|>'
Sep 26 10:35:01 ollama[6664]: print_info: EOS token        = 128009 '<|eot_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: EOT token        = 128009 '<|eot_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: EOM token        = 128008 '<|eom_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: PAD token        = 128009 '<|eot_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: LF token         = 198 'Ċ'
Sep 26 10:35:01 ollama[6664]: print_info: EOG token        = 128001 '<|end_of_text|>'
Sep 26 10:35:01 ollama[6664]: print_info: EOG token        = 128008 '<|eom_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: EOG token        = 128009 '<|eot_id|>'
Sep 26 10:35:01 ollama[6664]: print_info: max token length = 256
Sep 26 10:35:01 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Sep 26 10:35:23 ollama[6664]: load_tensors: offloading 0 repeating layers to GPU
Sep 26 10:35:23 ollama[6664]: load_tensors: offloaded 0/81 layers to GPU
Sep 26 10:35:23 ollama[6664]: load_tensors:    CUDA_Host model buffer size = 50532.31 MiB
Sep 26 10:35:59 ollama[6664]: llama_context: constructing llama_context
Sep 26 10:35:59 ollama[6664]: llama_context: n_seq_max     = 1
Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx         = 20000
Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx_per_seq = 20000
Sep 26 10:35:59 ollama[6664]: llama_context: n_batch       = 512
Sep 26 10:35:59 ollama[6664]: llama_context: n_ubatch      = 512
Sep 26 10:35:59 ollama[6664]: llama_context: causal_attn   = 1
Sep 26 10:35:59 ollama[6664]: llama_context: flash_attn    = 1
Sep 26 10:35:59 ollama[6664]: llama_context: kv_unified    = false
Sep 26 10:35:59 ollama[6664]: llama_context: freq_base     = 500000.0
Sep 26 10:35:59 ollama[6664]: llama_context: freq_scale    = 1
Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Sep 26 10:35:59 ollama[6664]: llama_context:        CPU  output buffer size =     0.52 MiB
Sep 26 10:35:59 ollama[6664]: llama_kv_cache_unified:        CPU KV buffer size =  3871.00 MiB
Sep 26 10:36:00 ollama[6664]: llama_kv_cache_unified: size = 3871.00 MiB ( 20224 cells,  80 layers,  1/1 seqs), K (f16): 1935.50 MiB, V (f16): 1935.50 MiB
Sep 26 10:36:00 ollama[6664]: llama_context:      CUDA0 compute buffer size =  1331.12 MiB
Sep 26 10:36:00 ollama[6664]: llama_context:  CUDA_Host compute buffer size =    55.51 MiB
Sep 26 10:36:00 ollama[6664]: llama_context: graph nodes  = 1743
Sep 26 10:36:00 ollama[6664]: llama_context: graph splits = 668 (with bs=512), 1 (with bs=1)
Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.546+03:00 level=INFO source=server.go:1289 msg="llama runner started in 68.57 seconds"
Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.547+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1
Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.547+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.548+03:00 level=INFO source=server.go:1289 msg="llama runner started in 68.58 seconds"
Sep 26 10:36:00 ollama[6664]: [GIN] 2025/09/26 - 10:36:00 | 200 |          1m9s |       127.0.0.1 | POST     "/api/generate"
Sep 26 10:41:19 ollama[6664]: [GIN] 2025/09/26 - 10:41:19 | 200 |         5m17s |       127.0.0.1 | POST     "/api/chat"
Sep 26 10:43:06 ollama[6664]: [GIN] 2025/09/26 - 10:43:06 | 200 |      20.489µs |       127.0.0.1 | HEAD     "/"
Sep 26 10:43:07 ollama[6664]: [GIN] 2025/09/26 - 10:43:07 | 200 |   43.440654ms |       127.0.0.1 | POST     "/api/show"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="22.6 GiB"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="22.6 GiB"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="31.0 GiB"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.821+03:00 level=INFO source=server.go:217 msg="enabling flash attention"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.828+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-0a79b00d4d08bda6eccd6b1d0317defe8e1c0ece06b6fc4745ec17a7323d5899 --port 35961"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.828+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.835+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine"
Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.836+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:35961"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="121.2 GiB" free_swap="1.4 GiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.2 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="30.6 GiB" free="31.0 GiB" minimum="457.0 MiB" overhead="0 B"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.142+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.163+03:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q8_0 name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33
Sep 26 10:43:08 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: found 3 CUDA devices:
Sep 26 10:43:08 ollama[6664]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Sep 26 10:43:08 ollama[6664]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Sep 26 10:43:08 ollama[6664]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Sep 26 10:43:08 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.247+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.291+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.391+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:487 msg="offloading 48 repeating layers to GPU"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:498 msg="offloaded 49/49 layers to GPU"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="12.7 GiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="17.3 GiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="315.3 MiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="790.0 MiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="96.3 MiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="103.8 MiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 MiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:342 msg="total memory" size="32.3 GiB"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=2
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
Sep 26 10:43:21 ollama[6664]: time=2025-09-26T10:43:21.177+03:00 level=INFO source=server.go:1289 msg="llama runner started in 13.35 seconds"
Sep 26 10:43:21 ollama[6664]: [GIN] 2025/09/26 - 10:43:21 | 200 | 14.148875755s |       127.0.0.1 | POST     "/api/generate"
Sep 26 10:43:43 ollama[6664]: [GIN] 2025/09/26 - 10:43:43 | 200 | 12.909124255s |       127.0.0.1 | POST     "/api/chat"

TL:DR
Magistral-Small-2509-GGUF:BF16: PASS
GLM-4.5-Air-Q8_0:latest: FAIL
Llama-3_3-Nemotron-Super-49B-v1_5-GGUF:Q8_0: PASS
qwen3:30b-a3b-thinking-2507-q8_0: PASS
gpt-oss: FAIL

@ka-admin commented on GitHub (Sep 26, 2025): @jessegross yes, exclude V100 helps to get the right answer I tested several LLM with version 0.12.2 and V100 included, here is the result: ``` ollama run hf.co/unsloth/Magistral-Small-2509-GGUF:BF16 >>> Hi, introduce yourself [THINK]Okay, the user wants me to introduce myself. Since I'm an AI assistant, I should start with a friendly greeting and provide some information about what I can do. Maybe mention that I'm here to help with questions, provide information, and assist with various tasks. Let me draft this. First, a greeting: "Hello! I'm an AI assistant designed to help answer your questions and provide useful information." Then, maybe a bit about my capabilities: "I can assist with a wide range of topics, from general knowledge to more specific inquiries, and I'm here to make your life easier." Finally, perhaps a question to engage the user further: "How can I assist you today?" Now, let's put it all together in a clear and concise response.[/THINK]Hello! I'm an AI assistant designed to help answer your questions and provide useful information. I can assist with a wide range of topics, from general knowledge to more specific inquiries, and I'm here to make your life easier. How can I assist you today? >>> root@:/ai# ollama run GLM-4.5-Air-Q8_0:latest >>> Hi, introduce yourself . I'm the admin and moderator of this site, you can call me "Hina". My hobbies are watching anime, playing games, drawing, reading manga, and browsing in the internet. If you have questions or suggestions, just send a message to my account. I'm sure everyone knows about the story already, so let's just skip it. The opening is great! I was really hyped up when I heard this song! The animation was also cool. It showed many things, like different characters from both sides of the war and their weapons. The ending was okay, but not that memorable for me since there are still more openings that will come in future episodes. The story is great! I'm a big fan of stories about revenge, so this anime really caught my attention. The plot is good too, it's not the typical "I want to destroy them all" kind of story. It has a bit of mystery and politics involved. The animation is great! This was produced by MAPPA, which is one of the best animation studios out there. The art style is consistent with their other works like Jujutsu Kaisen,^C >>> root@:/ai# ollama run hf.co/gabriellarson/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF:Q8_0 >>> Hi, introduce yourself <think> Okay, the user said "Hi, introduce yourself". Let me start by understanding what they need. They probably want to know who I am and what my role is. Since I'm an AI assistant, I should explain that clearly but in a friendly way. I should mention my purpose: helping with questions, providing information, and assisting with tasks. It's important to highlight the areas I can help with, like technology, culture, life, etc., as mentioned in the guidelines. Also, need to keep it conversational and not too formal. Maybe start with a greeting and then outline what I do. Make sure to invite them to ask questions or seek help so they know how to proceed. Check if there's any specific structure required. The example response starts with a hello, states being an AI assistant, lists capabilities, and invites interaction. That seems good to follow. Avoid using technical jargon. Keep it simple and approachable. Maybe add an emoji to keep the tone friendly, but not sure if that's necessary. The example uses a smiley, so maybe include one. Also, remember not to mention any specific knowledge cutoff dates unless asked. Just focus on what I can do rather than limitations. Alright, putting it all together: greet them, introduce as an AI assistant, list the areas of help, and encourage them to ask questions. </think> Hello! 😊 I'm an AI assistant designed to provide helpful, accurate, and friendly support. My main goal is to assist you with questions, share knowledge across various topics (like technology, culture, daily life, etc.), and make our conversation as useful and engaging as possible. Whether you need advice, explanations, or just someone to chat with, I'm here for it! What would you like to explore or discuss? 🌟 >>> root@:/ai# ollama run qwen3:30b-a3b-thinking-2507-q8_0 >>> Hi, introduce yourself Thinking... Okay, the user said, Hi, introduce yourself. I need to respond to this. First, I should figure out what they're asking for. They want me to introduce myself, so I should start by greeting them. The user might be new to interacting with me, so I should keep it friendly and simple. I should mention my name, Qwen, and maybe add a bit about what I can do. But I shouldn't make it too long. Let me check the previous conversation to see if there's any context. Wait, this is the first message, so no prior context. I need to make sure the response is in Chinese since the user's message is in Chinese. Wait, the user wrote "Hi, introduce yourself" in English, but the system prompt says to respond in the same language as the user. The user's message is in English, so I should respond in English. Wait, the user's message is "Hi, introduce yourself" which is English, so the response should be in English. But the system prompt might require Chinese. Wait, the user's instruction says "Please write in Chinese", but the user's query is in English. Hmm, need to check the system settings. Wait, looking at the system prompt, it says "You are a helpful assistant. Please write in Chinese." So even if the user's query is in English, the response should be in Chinese. Wait, but the user's query is in English. Let me check again. Wait, the user's message is "Hi, introduce yourself" in English. But the system instruction says to write in Chinese. So I should respond in Chinese. But maybe the user expects English. Wait, the system prompt says "Please write in Chinese." So I need to respond in Chinese. Wait, but the user's message is in English, so maybe they prefer English. But the system instruction is clear: write in Chinese. So I should respond in Chinese. Wait, but the user might not be comfortable with Chinese. However, the system is set to respond in Chinese, so I'll proceed with that. So, the response should be in Chinese. Let me draft it. First, greet them: 你好！我是通义千问，由通义实验室研发的超大规模语言模型。 Then, mention my capabilities: 我可以回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、逻辑推理、编程等等。 Add a friendly note: 有什么需要帮助的吗？😊 Wait, but the user just asked to introduce myself, so maybe keep it concise. Let me check the example responses. Wait, the user's instruction says "Hi, introduce yourself" so the introduction should be brief. Wait, the user might be testing if I can follow the language instruction. So I should write in Chinese. So the correct response is in Chinese. Let me make sure it's natural. "你好！我是通义千问，阿里巴巴集团旗下的通义实验室自主研发的超大规模语言模型。我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊" Wait, but the user might not know what "通义实验室" is, but it's part of the official name. Maybe just say "阿里巴巴通义实验室" or "通义实验室". Wait, the official name is "通义实验室", so it's fine. Also, make sure to mention that I'm a language model, so "超大规模语言模型". Check for grammar: "我能够回答问题、创作文字，比如写故事、写公文、写邮件、写剧本、进行逻辑推理、编程等。" Yes, that's correct. Now, check if the user's message is in English but the response should be in Chinese. So the response should be in Chinese. Wait, the user's message is in English, but the system says to respond in Chinese. So the response must be in Chinese. So the final response is: 你好！我是通义千问，阿里巴巴通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊 Wait, "阿里巴巴通义实验室" or "通义实验室"? The official name is "通义实验室", so maybe just "通义实验室" since it's part of the company's name. Wait, the official website says "通义实验室" under Alibaba Cloud. So it's correct to say "通义实验室研发的". So the response should be: 你好！我是通义千问，通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊 Yes, that's good. Make sure to use Chinese punctuation. Also, use the correct Chinese terms for the tasks. Check for any errors. "创作文字" is correct, as in creating text. Okay, that should be the response. ...done thinking. 你好！我是通义千问，通义实验室研发的超大规模语言模型。我可以回答问题、创作文字（比如写故事、写公文、写邮件、写剧本）、进行逻辑推理、编程等。有什么需要我帮忙的吗？😊 >>> ``` logs ``` Started ollama.service - Ollama Service. Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.943+03:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:20000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.945+03:00 level=INFO source=images.go:518 msg="total blobs: 43" Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=images.go:525 msg="total unused blobs removed: 0" Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.2)" Sep 26 10:30:42 ollama[6664]: time=2025-09-26T10:30:42.946+03:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda variant=v12 compute=8.9 driver=13.0 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" Sep 26 10:30:43 ollama[6664]: time=2025-09-26T10:30:43.504+03:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda variant=v12 compute=7.0 driver=13.0 name="Tesla V100-SXM2-32GB" total="31.7 GiB" available="31.4 GiB" Sep 26 10:30:49 ollama[6664]: [GIN] 2025/09/26 - 10:30:49 | 200 | 37.71µs | 127.0.0.1 | HEAD "/" Sep 26 10:30:49 ollama[6664]: [GIN] 2025/09/26 - 10:30:49 | 200 | 1.200331ms | 127.0.0.1 | GET "/api/tags" Sep 26 10:31:11 ollama[6664]: [GIN] 2025/09/26 - 10:31:11 | 200 | 20.92µs | 127.0.0.1 | HEAD "/" Sep 26 10:31:11 ollama[6664]: [GIN] 2025/09/26 - 10:31:11 | 200 | 50.76395ms | 127.0.0.1 | POST "/api/show" Sep 26 10:31:11 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 (version GGUF V3 (latest)) Sep 26 10:31:11 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = llama Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 2: general.name str = Magistral-Small-2509 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 3: general.version str = 2509 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 4: general.basename str = Magistral-Small-2509 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 5: general.quantized_by str = Unsloth Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 6: general.size_label str = Small Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 7: general.license str = apache-2.0 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 9: general.base_model.count u32 = 1 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 10: general.base_model.0.name str = Mistral Small 3.2 24B Instruct 2506 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 11: general.base_model.0.version str = 2506 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 12: general.base_model.0.organization str = Mistralai Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mist... Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 14: general.tags arr[str,2] = ["vllm", "mistral-common"] Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 15: general.languages arr[str,24] = ["en", "fr", "de", "es", "pt", "it", ... Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 16: llama.block_count u32 = 40 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 17: llama.context_length u32 = 131072 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 18: llama.embedding_length u32 = 5120 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 19: llama.feed_forward_length u32 = 32768 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 20: llama.attention.head_count u32 = 32 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 21: llama.attention.head_count_kv u32 = 8 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 22: llama.rope.freq_base f32 = 1000000000.000000 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 23: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 24: llama.attention.key_length u32 = 128 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 25: llama.attention.value_length u32 = 128 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 26: general.file_type u32 = 32 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 27: llama.vocab_size u32 = 131072 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 28: llama.rope.dimension_count u32 = 128 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.pre str = tekken Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[... Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Sep 26 10:31:11 ollama[6664]: [132B blob data] Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 1 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 2 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 0 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 11 Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 40: tokenizer.ggml.add_sep_token bool = false Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 42: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... Sep 26 10:31:11 ollama[6664]: llama_model_loader: - kv 43: tokenizer.ggml.add_space_prefix bool = false Sep 26 10:31:11 ollama[6664]: llama_model_loader: - type f32: 81 tensors Sep 26 10:31:11 ollama[6664]: llama_model_loader: - type bf16: 282 tensors Sep 26 10:31:11 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:31:11 ollama[6664]: print_info: file type = BF16 Sep 26 10:31:11 ollama[6664]: print_info: file size = 43.91 GiB (16.00 BPW) Sep 26 10:31:11 ollama[6664]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:31:11 ollama[6664]: load: printing all EOG tokens: Sep 26 10:31:11 ollama[6664]: load: - 2 ('</s>') Sep 26 10:31:11 ollama[6664]: load: special tokens cache size = 1000 Sep 26 10:31:11 ollama[6664]: load: token to piece cache size = 0.8498 MB Sep 26 10:31:11 ollama[6664]: print_info: arch = llama Sep 26 10:31:11 ollama[6664]: print_info: vocab_only = 1 Sep 26 10:31:11 ollama[6664]: print_info: model type = ?B Sep 26 10:31:11 ollama[6664]: print_info: model params = 23.57 B Sep 26 10:31:11 ollama[6664]: print_info: general.name = Magistral-Small-2509 Sep 26 10:31:11 ollama[6664]: print_info: vocab type = BPE Sep 26 10:31:11 ollama[6664]: print_info: n_vocab = 131072 Sep 26 10:31:11 ollama[6664]: print_info: n_merges = 269443 Sep 26 10:31:11 ollama[6664]: print_info: BOS token = 1 '<s>' Sep 26 10:31:11 ollama[6664]: print_info: EOS token = 2 '</s>' Sep 26 10:31:11 ollama[6664]: print_info: UNK token = 0 '<unk>' Sep 26 10:31:11 ollama[6664]: print_info: PAD token = 11 '<pad>' Sep 26 10:31:11 ollama[6664]: print_info: LF token = 1010 'Ċ' Sep 26 10:31:11 ollama[6664]: print_info: EOG token = 2 '</s>' Sep 26 10:31:11 ollama[6664]: print_info: max token length = 150 Sep 26 10:31:11 ollama[6664]: llama_model_load: vocab only - skipping tensors Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.042+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.042+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 --port 41649" Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.050+03:00 level=INFO source=runner.go:864 msg="starting go runner" Sep 26 10:31:12 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 26 10:31:12 ollama[6664]: ggml_cuda_init: found 3 CUDA devices: Sep 26 10:31:12 ollama[6664]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 26 10:31:12 ollama[6664]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 26 10:31:12 ollama[6664]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 26 10:31:12 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.120+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.120+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:41649" Sep 26 10:31:12 ollama[6664]: time=2025-09-26T10:31:12.374+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.8 GiB" free_swap="8.0 GiB" Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.013+03:00 level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 library=cuda parallel=1 required="52.4 GiB" gpus=2 Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.334+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=41 layers.split="[23 18]" memory.available="[31.4 GiB 23.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="52.4 GiB" memory.required.partial="52.4 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[29.5 GiB 23.0 GiB]" memory.weights.total="42.7 GiB" memory.weights.repeating="41.4 GiB" memory.weights.nonrepeating="1.3 GiB" memory.graph.full="1.4 GiB" memory.graph.partial="1.4 GiB" projector.weights="838.5 MiB" projector.graph="0 B" Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:41[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:23(0..22) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:18(23..40)] MultiUserCache:false ProjectorPath:/ai/llm/models/blobs/sha256-d7bca1808bd578add6687383b4c9d53b7b2b07e049dd926362df6fdc3bf96308 MainGPU:0 UseMmap:true}" Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:31:13 ollama[6664]: time=2025-09-26T10:31:13.335+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23676 MiB free Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23685 MiB free Sep 26 10:31:13 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32183 MiB free Sep 26 10:31:13 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from /ai/llm/models/blobs/sha256-911daa502650896bc123e25de8ac0d8df87989b1697015afbe8f7da8ddb26168 (version GGUF V3 (latest)) Sep 26 10:31:13 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = llama Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 2: general.name str = Magistral-Small-2509 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 3: general.version str = 2509 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 4: general.basename str = Magistral-Small-2509 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 5: general.quantized_by str = Unsloth Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 6: general.size_label str = Small Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 7: general.license str = apache-2.0 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 9: general.base_model.count u32 = 1 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 10: general.base_model.0.name str = Mistral Small 3.2 24B Instruct 2506 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 11: general.base_model.0.version str = 2506 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 12: general.base_model.0.organization str = Mistralai Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/mistralai/Mist... Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 14: general.tags arr[str,2] = ["vllm", "mistral-common"] Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 15: general.languages arr[str,24] = ["en", "fr", "de", "es", "pt", "it", ... Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 16: llama.block_count u32 = 40 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 17: llama.context_length u32 = 131072 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 18: llama.embedding_length u32 = 5120 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 19: llama.feed_forward_length u32 = 32768 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 20: llama.attention.head_count u32 = 32 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 21: llama.attention.head_count_kv u32 = 8 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 22: llama.rope.freq_base f32 = 1000000000.000000 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 23: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 24: llama.attention.key_length u32 = 128 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 25: llama.attention.value_length u32 = 128 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 26: general.file_type u32 = 32 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 27: llama.vocab_size u32 = 131072 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 28: llama.rope.dimension_count u32 = 128 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.pre str = tekken Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[... Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Sep 26 10:31:13 ollama[6664]: [132B blob data] Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 1 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 36: tokenizer.ggml.eos_token_id u32 = 2 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 0 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 11 Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 40: tokenizer.ggml.add_sep_token bool = false Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 42: tokenizer.chat_template str = {#- Copyright 2025-present the Unslot... Sep 26 10:31:13 ollama[6664]: llama_model_loader: - kv 43: tokenizer.ggml.add_space_prefix bool = false Sep 26 10:31:13 ollama[6664]: llama_model_loader: - type f32: 81 tensors Sep 26 10:31:13 ollama[6664]: llama_model_loader: - type bf16: 282 tensors Sep 26 10:31:13 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:31:13 ollama[6664]: print_info: file type = BF16 Sep 26 10:31:13 ollama[6664]: print_info: file size = 43.91 GiB (16.00 BPW) Sep 26 10:31:13 ollama[6664]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:31:13 ollama[6664]: load: printing all EOG tokens: Sep 26 10:31:13 ollama[6664]: load: - 2 ('</s>') Sep 26 10:31:13 ollama[6664]: load: special tokens cache size = 1000 Sep 26 10:31:13 ollama[6664]: load: token to piece cache size = 0.8498 MB Sep 26 10:31:13 ollama[6664]: print_info: arch = llama Sep 26 10:31:13 ollama[6664]: print_info: vocab_only = 0 Sep 26 10:31:13 ollama[6664]: print_info: n_ctx_train = 131072 Sep 26 10:31:13 ollama[6664]: print_info: n_embd = 5120 Sep 26 10:31:13 ollama[6664]: print_info: n_layer = 40 Sep 26 10:31:13 ollama[6664]: print_info: n_head = 32 Sep 26 10:31:13 ollama[6664]: print_info: n_head_kv = 8 Sep 26 10:31:13 ollama[6664]: print_info: n_rot = 128 Sep 26 10:31:13 ollama[6664]: print_info: n_swa = 0 Sep 26 10:31:13 ollama[6664]: print_info: is_swa_any = 0 Sep 26 10:31:13 ollama[6664]: print_info: n_embd_head_k = 128 Sep 26 10:31:13 ollama[6664]: print_info: n_embd_head_v = 128 Sep 26 10:31:13 ollama[6664]: print_info: n_gqa = 4 Sep 26 10:31:13 ollama[6664]: print_info: n_embd_k_gqa = 1024 Sep 26 10:31:13 ollama[6664]: print_info: n_embd_v_gqa = 1024 Sep 26 10:31:13 ollama[6664]: print_info: f_norm_eps = 0.0e+00 Sep 26 10:31:13 ollama[6664]: print_info: f_norm_rms_eps = 1.0e-05 Sep 26 10:31:13 ollama[6664]: print_info: f_clamp_kqv = 0.0e+00 Sep 26 10:31:13 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00 Sep 26 10:31:13 ollama[6664]: print_info: f_logit_scale = 0.0e+00 Sep 26 10:31:13 ollama[6664]: print_info: f_attn_scale = 0.0e+00 Sep 26 10:31:13 ollama[6664]: print_info: n_ff = 32768 Sep 26 10:31:13 ollama[6664]: print_info: n_expert = 0 Sep 26 10:31:13 ollama[6664]: print_info: n_expert_used = 0 Sep 26 10:31:13 ollama[6664]: print_info: causal attn = 1 Sep 26 10:31:13 ollama[6664]: print_info: pooling type = 0 Sep 26 10:31:13 ollama[6664]: print_info: rope type = 0 Sep 26 10:31:13 ollama[6664]: print_info: rope scaling = linear Sep 26 10:31:13 ollama[6664]: print_info: freq_base_train = 1000000000.0 Sep 26 10:31:13 ollama[6664]: print_info: freq_scale_train = 1 Sep 26 10:31:13 ollama[6664]: print_info: n_ctx_orig_yarn = 131072 Sep 26 10:31:13 ollama[6664]: print_info: rope_finetuned = unknown Sep 26 10:31:13 ollama[6664]: print_info: model type = 13B Sep 26 10:31:13 ollama[6664]: print_info: model params = 23.57 B Sep 26 10:31:13 ollama[6664]: print_info: general.name = Magistral-Small-2509 Sep 26 10:31:13 ollama[6664]: print_info: vocab type = BPE Sep 26 10:31:13 ollama[6664]: print_info: n_vocab = 131072 Sep 26 10:31:13 ollama[6664]: print_info: n_merges = 269443 Sep 26 10:31:13 ollama[6664]: print_info: BOS token = 1 '<s>' Sep 26 10:31:13 ollama[6664]: print_info: EOS token = 2 '</s>' Sep 26 10:31:13 ollama[6664]: print_info: UNK token = 0 '<unk>' Sep 26 10:31:13 ollama[6664]: print_info: PAD token = 11 '<pad>' Sep 26 10:31:13 ollama[6664]: print_info: LF token = 1010 'Ċ' Sep 26 10:31:13 ollama[6664]: print_info: EOG token = 2 '</s>' Sep 26 10:31:13 ollama[6664]: print_info: max token length = 150 Sep 26 10:31:13 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = true) Sep 26 10:31:29 ollama[6664]: load_tensors: offloading 40 repeating layers to GPU Sep 26 10:31:29 ollama[6664]: load_tensors: offloading output layer to GPU Sep 26 10:31:29 ollama[6664]: load_tensors: offloaded 41/41 layers to GPU Sep 26 10:31:29 ollama[6664]: load_tensors: CUDA1 model buffer size = 19080.70 MiB Sep 26 10:31:29 ollama[6664]: load_tensors: CUDA2 model buffer size = 24600.88 MiB Sep 26 10:31:29 ollama[6664]: load_tensors: CPU_Mapped model buffer size = 1280.00 MiB Sep 26 10:31:44 ollama[6664]: llama_context: constructing llama_context Sep 26 10:31:44 ollama[6664]: llama_context: n_seq_max = 1 Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx = 20000 Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx_per_seq = 20000 Sep 26 10:31:44 ollama[6664]: llama_context: n_batch = 512 Sep 26 10:31:44 ollama[6664]: llama_context: n_ubatch = 512 Sep 26 10:31:44 ollama[6664]: llama_context: causal_attn = 1 Sep 26 10:31:44 ollama[6664]: llama_context: flash_attn = 1 Sep 26 10:31:44 ollama[6664]: llama_context: kv_unified = false Sep 26 10:31:44 ollama[6664]: llama_context: freq_base = 1000000000.0 Sep 26 10:31:44 ollama[6664]: llama_context: freq_scale = 1 Sep 26 10:31:44 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Sep 26 10:31:44 ollama[6664]: llama_context: CUDA_Host output buffer size = 0.52 MiB Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified: CUDA1 KV buffer size = 1422.00 MiB Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified: CUDA2 KV buffer size = 1738.00 MiB Sep 26 10:31:44 ollama[6664]: llama_kv_cache_unified: size = 3160.00 MiB ( 20224 cells, 40 layers, 1/1 seqs), K (f16): 1580.00 MiB, V (f16): 1580.00 MiB Sep 26 10:31:44 ollama[6664]: llama_context: pipeline parallelism enabled (n_copies=4) Sep 26 10:31:44 ollama[6664]: llama_context: CUDA1 compute buffer size = 445.79 MiB Sep 26 10:31:44 ollama[6664]: llama_context: CUDA2 compute buffer size = 385.05 MiB Sep 26 10:31:44 ollama[6664]: llama_context: CUDA_Host compute buffer size = 168.05 MiB Sep 26 10:31:44 ollama[6664]: llama_context: graph nodes = 1247 Sep 26 10:31:44 ollama[6664]: llama_context: graph splits = 3 Sep 26 10:31:44 ollama[6664]: clip_model_loader: model name: Magistral-Small-2509 Sep 26 10:31:44 ollama[6664]: clip_model_loader: description: Sep 26 10:31:44 ollama[6664]: clip_model_loader: GGUF version: 3 Sep 26 10:31:44 ollama[6664]: clip_model_loader: alignment: 32 Sep 26 10:31:44 ollama[6664]: clip_model_loader: n_tensors: 223 Sep 26 10:31:44 ollama[6664]: clip_model_loader: n_kv: 32 Sep 26 10:31:44 ollama[6664]: clip_model_loader: has vision encoder Sep 26 10:31:44 ollama[6664]: clip_ctx: CLIP using CUDA0 backend Sep 26 10:31:44 ollama[6664]: load_hparams: projector: pixtral Sep 26 10:31:44 ollama[6664]: load_hparams: n_embd: 1024 Sep 26 10:31:44 ollama[6664]: load_hparams: n_head: 16 Sep 26 10:31:44 ollama[6664]: load_hparams: n_ff: 4096 Sep 26 10:31:44 ollama[6664]: load_hparams: n_layer: 24 Sep 26 10:31:44 ollama[6664]: load_hparams: ffn_op: silu Sep 26 10:31:44 ollama[6664]: load_hparams: projection_dim: 5120 Sep 26 10:31:44 ollama[6664]: --- vision hparams --- Sep 26 10:31:44 ollama[6664]: load_hparams: image_size: 1024 Sep 26 10:31:44 ollama[6664]: load_hparams: patch_size: 14 Sep 26 10:31:44 ollama[6664]: load_hparams: has_llava_proj: 0 Sep 26 10:31:44 ollama[6664]: load_hparams: minicpmv_version: 0 Sep 26 10:31:44 ollama[6664]: load_hparams: proj_scale_factor: 0 Sep 26 10:31:44 ollama[6664]: load_hparams: n_wa_pattern: 0 Sep 26 10:31:44 ollama[6664]: load_hparams: model size: 838.51 MiB Sep 26 10:31:44 ollama[6664]: load_hparams: metadata size: 0.08 MiB Sep 26 10:31:45 ollama[6664]: alloc_compute_meta: CUDA0 compute buffer size = 3.97 MiB Sep 26 10:31:45 ollama[6664]: alloc_compute_meta: CPU compute buffer size = 0.14 MiB Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.420+03:00 level=INFO source=server.go:1289 msg="llama runner started in 33.38 seconds" Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:31:45 ollama[6664]: time=2025-09-26T10:31:45.421+03:00 level=INFO source=server.go:1289 msg="llama runner started in 33.38 seconds" Sep 26 10:31:45 ollama[6664]: [GIN] 2025/09/26 - 10:31:45 | 200 | 34.347339414s | 127.0.0.1 | POST "/api/generate" Sep 26 10:32:07 ollama[6664]: [GIN] 2025/09/26 - 10:32:07 | 200 | 11.957087221s | 127.0.0.1 | POST "/api/chat" Sep 26 10:32:29 ollama[6664]: [GIN] 2025/09/26 - 10:32:29 | 200 | 15.369µs | 127.0.0.1 | HEAD "/" Sep 26 10:32:29 ollama[6664]: [GIN] 2025/09/26 - 10:32:29 | 200 | 53.48872ms | 127.0.0.1 | POST "/api/show" Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="21.9 GiB" Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="578.0 MiB" Sep 26 10:32:29 ollama[6664]: time=2025-09-26T10:32:29.762+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="2.3 GiB" Sep 26 10:32:29 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 803 tensors from /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 (version GGUF V3 (latest)) Sep 26 10:32:29 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = glm4moe Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 2: general.name str = GLM 4.5 Air Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 3: general.size_label str = 128x9.4B Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 4: general.license str = mit Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 5: general.tags arr[str,1] = ["text-generation"] Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 6: general.languages arr[str,2] = ["en", "zh"] Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 7: glm4moe.block_count u32 = 47 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 8: glm4moe.context_length u32 = 131072 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 9: glm4moe.embedding_length u32 = 4096 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 10: glm4moe.feed_forward_length u32 = 10944 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 11: glm4moe.attention.head_count u32 = 96 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 12: glm4moe.attention.head_count_kv u32 = 8 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 13: glm4moe.rope.freq_base f32 = 1000000.000000 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 14: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 15: glm4moe.expert_used_count u32 = 8 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 16: glm4moe.attention.key_length u32 = 128 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 17: glm4moe.attention.value_length u32 = 128 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 18: glm4moe.rope.dimension_count u32 = 64 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 19: glm4moe.expert_count u32 = 128 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 20: glm4moe.expert_feed_forward_length u32 = 1408 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 21: glm4moe.expert_shared_count u32 = 1 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 22: glm4moe.leading_dense_block_count u32 = 1 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 23: glm4moe.expert_gating_func u32 = 2 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 24: glm4moe.expert_weights_scale f32 = 1.000000 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 25: glm4moe.expert_weights_norm bool = true Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 26: glm4moe.nextn_predict_layers u32 = 1 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = glm4 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 151329 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 151329 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 151331 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 35: tokenizer.ggml.eot_token_id u32 = 151336 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 36: tokenizer.ggml.unknown_token_id u32 = 151329 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 37: tokenizer.ggml.eom_token_id u32 = 151338 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 38: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste... Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 39: general.quantization_version u32 = 2 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 40: general.file_type u32 = 7 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 41: split.no u16 = 0 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 42: split.tensors.count i32 = 803 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - kv 43: split.count u16 = 0 Sep 26 10:32:29 ollama[6664]: llama_model_loader: - type f32: 331 tensors Sep 26 10:32:29 ollama[6664]: llama_model_loader: - type q8_0: 472 tensors Sep 26 10:32:29 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:32:29 ollama[6664]: print_info: file type = Q8_0 Sep 26 10:32:29 ollama[6664]: print_info: file size = 109.38 GiB (8.51 BPW) Sep 26 10:32:29 ollama[6664]: load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:32:29 ollama[6664]: load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:32:29 ollama[6664]: load: printing all EOG tokens: Sep 26 10:32:29 ollama[6664]: load: - 151329 ('<|endoftext|>') Sep 26 10:32:29 ollama[6664]: load: - 151336 ('<|user|>') Sep 26 10:32:29 ollama[6664]: load: - 151338 ('<|observation|>') Sep 26 10:32:29 ollama[6664]: load: special tokens cache size = 36 Sep 26 10:32:29 ollama[6664]: load: token to piece cache size = 0.9713 MB Sep 26 10:32:29 ollama[6664]: print_info: arch = glm4moe Sep 26 10:32:29 ollama[6664]: print_info: vocab_only = 1 Sep 26 10:32:29 ollama[6664]: print_info: model type = ?B Sep 26 10:32:29 ollama[6664]: print_info: model params = 110.47 B Sep 26 10:32:29 ollama[6664]: print_info: general.name = GLM 4.5 Air Sep 26 10:32:29 ollama[6664]: print_info: vocab type = BPE Sep 26 10:32:29 ollama[6664]: print_info: n_vocab = 151552 Sep 26 10:32:29 ollama[6664]: print_info: n_merges = 318088 Sep 26 10:32:29 ollama[6664]: print_info: BOS token = 151331 '[gMASK]' Sep 26 10:32:29 ollama[6664]: print_info: EOS token = 151329 '<|endoftext|>' Sep 26 10:32:29 ollama[6664]: print_info: EOT token = 151336 '<|user|>' Sep 26 10:32:29 ollama[6664]: print_info: EOM token = 151338 '<|observation|>' Sep 26 10:32:29 ollama[6664]: print_info: UNK token = 151329 '<|endoftext|>' Sep 26 10:32:29 ollama[6664]: print_info: PAD token = 151329 '<|endoftext|>' Sep 26 10:32:29 ollama[6664]: print_info: LF token = 198 'Ċ' Sep 26 10:32:29 ollama[6664]: print_info: FIM PRE token = 151347 '<|code_prefix|>' Sep 26 10:32:29 ollama[6664]: print_info: FIM SUF token = 151349 '<|code_suffix|>' Sep 26 10:32:29 ollama[6664]: print_info: FIM MID token = 151348 '<|code_middle|>' Sep 26 10:32:29 ollama[6664]: print_info: EOG token = 151329 '<|endoftext|>' Sep 26 10:32:29 ollama[6664]: print_info: EOG token = 151336 '<|user|>' Sep 26 10:32:29 ollama[6664]: print_info: EOG token = 151338 '<|observation|>' Sep 26 10:32:29 ollama[6664]: print_info: max token length = 1024 Sep 26 10:32:29 ollama[6664]: llama_model_load: vocab only - skipping tensors Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.311+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.311+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 --port 44563" Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.319+03:00 level=INFO source=runner.go:864 msg="starting go runner" Sep 26 10:32:30 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 26 10:32:30 ollama[6664]: ggml_cuda_init: found 3 CUDA devices: Sep 26 10:32:30 ollama[6664]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 26 10:32:30 ollama[6664]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 26 10:32:30 ollama[6664]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 26 10:32:30 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.389+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.390+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:44563" Sep 26 10:32:30 ollama[6664]: time=2025-09-26T10:32:30.630+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="168.8 GiB" free_swap="8.0 GiB" Sep 26 10:32:31 ollama[6664]: time=2025-09-26T10:32:31.561+03:00 level=INFO source=server.go:511 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B" Sep 26 10:32:34 ollama[6664]: time=2025-09-26T10:32:34.525+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.6 GiB" free_swap="8.0 GiB" Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.820+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=48 layers.offload=20 layers.split="[6 5 9]" memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="136.2 GiB" memory.required.partial="73.0 GiB" memory.required.kv="3.6 GiB" memory.required.allocations="[22.3 GiB 21.2 GiB 29.5 GiB]" memory.weights.total="108.8 GiB" memory.weights.repeating="108.2 GiB" memory.weights.nonrepeating="629.0 MiB" memory.graph.full="7.2 GiB" memory.graph.partial="7.2 GiB" Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:20[ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:6(27..32) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:5(33..37) ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:9(38..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:32:35 ollama[6664]: time=2025-09-26T10:32:35.821+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 26 10:32:35 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23620 MiB free Sep 26 10:32:35 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23683 MiB free Sep 26 10:32:36 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32109 MiB free Sep 26 10:32:36 ollama[6664]: llama_model_loader: loaded meta data with 44 key-value pairs and 803 tensors from /ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 (version GGUF V3 (latest)) Sep 26 10:32:36 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = glm4moe Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 2: general.name str = GLM 4.5 Air Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 3: general.size_label str = 128x9.4B Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 4: general.license str = mit Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 5: general.tags arr[str,1] = ["text-generation"] Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 6: general.languages arr[str,2] = ["en", "zh"] Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 7: glm4moe.block_count u32 = 47 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 8: glm4moe.context_length u32 = 131072 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 9: glm4moe.embedding_length u32 = 4096 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 10: glm4moe.feed_forward_length u32 = 10944 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 11: glm4moe.attention.head_count u32 = 96 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 12: glm4moe.attention.head_count_kv u32 = 8 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 13: glm4moe.rope.freq_base f32 = 1000000.000000 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 14: glm4moe.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 15: glm4moe.expert_used_count u32 = 8 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 16: glm4moe.attention.key_length u32 = 128 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 17: glm4moe.attention.value_length u32 = 128 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 18: glm4moe.rope.dimension_count u32 = 64 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 19: glm4moe.expert_count u32 = 128 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 20: glm4moe.expert_feed_forward_length u32 = 1408 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 21: glm4moe.expert_shared_count u32 = 1 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 22: glm4moe.leading_dense_block_count u32 = 1 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 23: glm4moe.expert_gating_func u32 = 2 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 24: glm4moe.expert_weights_scale f32 = 1.000000 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 25: glm4moe.expert_weights_norm bool = true Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 26: glm4moe.nextn_predict_layers u32 = 1 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = glm4 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,318088] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 151329 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 151329 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 151331 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 35: tokenizer.ggml.eot_token_id u32 = 151336 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 36: tokenizer.ggml.unknown_token_id u32 = 151329 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 37: tokenizer.ggml.eom_token_id u32 = 151338 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 38: tokenizer.chat_template str = [gMASK]<sop>\n{%- if tools -%}\n<|syste... Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 39: general.quantization_version u32 = 2 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 40: general.file_type u32 = 7 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 41: split.no u16 = 0 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 42: split.tensors.count i32 = 803 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - kv 43: split.count u16 = 0 Sep 26 10:32:36 ollama[6664]: llama_model_loader: - type f32: 331 tensors Sep 26 10:32:36 ollama[6664]: llama_model_loader: - type q8_0: 472 tensors Sep 26 10:32:36 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:32:36 ollama[6664]: print_info: file type = Q8_0 Sep 26 10:32:36 ollama[6664]: print_info: file size = 109.38 GiB (8.51 BPW) Sep 26 10:32:36 ollama[6664]: load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:32:36 ollama[6664]: load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect Sep 26 10:32:36 ollama[6664]: load: printing all EOG tokens: Sep 26 10:32:36 ollama[6664]: load: - 151329 ('<|endoftext|>') Sep 26 10:32:36 ollama[6664]: load: - 151336 ('<|user|>') Sep 26 10:32:36 ollama[6664]: load: - 151338 ('<|observation|>') Sep 26 10:32:36 ollama[6664]: load: special tokens cache size = 36 Sep 26 10:32:36 ollama[6664]: load: token to piece cache size = 0.9713 MB Sep 26 10:32:36 ollama[6664]: print_info: arch = glm4moe Sep 26 10:32:36 ollama[6664]: print_info: vocab_only = 0 Sep 26 10:32:36 ollama[6664]: print_info: n_ctx_train = 131072 Sep 26 10:32:36 ollama[6664]: print_info: n_embd = 4096 Sep 26 10:32:36 ollama[6664]: print_info: n_layer = 47 Sep 26 10:32:36 ollama[6664]: print_info: n_head = 96 Sep 26 10:32:36 ollama[6664]: print_info: n_head_kv = 8 Sep 26 10:32:36 ollama[6664]: print_info: n_rot = 64 Sep 26 10:32:36 ollama[6664]: print_info: n_swa = 0 Sep 26 10:32:36 ollama[6664]: print_info: is_swa_any = 0 Sep 26 10:32:36 ollama[6664]: print_info: n_embd_head_k = 128 Sep 26 10:32:36 ollama[6664]: print_info: n_embd_head_v = 128 Sep 26 10:32:36 ollama[6664]: print_info: n_gqa = 12 Sep 26 10:32:36 ollama[6664]: print_info: n_embd_k_gqa = 1024 Sep 26 10:32:36 ollama[6664]: print_info: n_embd_v_gqa = 1024 Sep 26 10:32:36 ollama[6664]: print_info: f_norm_eps = 0.0e+00 Sep 26 10:32:36 ollama[6664]: print_info: f_norm_rms_eps = 1.0e-05 Sep 26 10:32:36 ollama[6664]: print_info: f_clamp_kqv = 0.0e+00 Sep 26 10:32:36 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00 Sep 26 10:32:36 ollama[6664]: print_info: f_logit_scale = 0.0e+00 Sep 26 10:32:36 ollama[6664]: print_info: f_attn_scale = 0.0e+00 Sep 26 10:32:36 ollama[6664]: print_info: n_ff = 10944 Sep 26 10:32:36 ollama[6664]: print_info: n_expert = 128 Sep 26 10:32:36 ollama[6664]: print_info: n_expert_used = 8 Sep 26 10:32:36 ollama[6664]: print_info: causal attn = 1 Sep 26 10:32:36 ollama[6664]: print_info: pooling type = 0 Sep 26 10:32:36 ollama[6664]: print_info: rope type = 2 Sep 26 10:32:36 ollama[6664]: print_info: rope scaling = linear Sep 26 10:32:36 ollama[6664]: print_info: freq_base_train = 1000000.0 Sep 26 10:32:36 ollama[6664]: print_info: freq_scale_train = 1 Sep 26 10:32:36 ollama[6664]: print_info: n_ctx_orig_yarn = 131072 Sep 26 10:32:36 ollama[6664]: print_info: rope_finetuned = unknown Sep 26 10:32:36 ollama[6664]: print_info: model type = 106B.A12B Sep 26 10:32:36 ollama[6664]: print_info: model params = 110.47 B Sep 26 10:32:36 ollama[6664]: print_info: general.name = GLM 4.5 Air Sep 26 10:32:36 ollama[6664]: print_info: vocab type = BPE Sep 26 10:32:36 ollama[6664]: print_info: n_vocab = 151552 Sep 26 10:32:36 ollama[6664]: print_info: n_merges = 318088 Sep 26 10:32:36 ollama[6664]: print_info: BOS token = 151331 '[gMASK]' Sep 26 10:32:36 ollama[6664]: print_info: EOS token = 151329 '<|endoftext|>' Sep 26 10:32:36 ollama[6664]: print_info: EOT token = 151336 '<|user|>' Sep 26 10:32:36 ollama[6664]: print_info: EOM token = 151338 '<|observation|>' Sep 26 10:32:36 ollama[6664]: print_info: UNK token = 151329 '<|endoftext|>' Sep 26 10:32:36 ollama[6664]: print_info: PAD token = 151329 '<|endoftext|>' Sep 26 10:32:36 ollama[6664]: print_info: LF token = 198 'Ċ' Sep 26 10:32:36 ollama[6664]: print_info: FIM PRE token = 151347 '<|code_prefix|>' Sep 26 10:32:36 ollama[6664]: print_info: FIM SUF token = 151349 '<|code_suffix|>' Sep 26 10:32:36 ollama[6664]: print_info: FIM MID token = 151348 '<|code_middle|>' Sep 26 10:32:36 ollama[6664]: print_info: EOG token = 151329 '<|endoftext|>' Sep 26 10:32:36 ollama[6664]: print_info: EOG token = 151336 '<|user|>' Sep 26 10:32:36 ollama[6664]: print_info: EOG token = 151338 '<|observation|>' Sep 26 10:32:36 ollama[6664]: print_info: max token length = 1024 Sep 26 10:32:36 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = true) Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_q.weight (size = 53477376 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_k.weight (size = 4456448 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_v.weight (size = 4456448 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.attn_output.weight (size = 53477376 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_exps.weight (size = 784334848 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_down_exps.weight (size = 784334848 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_up_exps.weight (size = 784334848 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_gate_shexp.weight (size = 6127616 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_down_shexp.weight (size = 6127616 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.ffn_up_shexp.weight (size = 6127616 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.embed_tokens.weight (size = 659554304 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.shared_head_head.weight (size = 659554304 bytes) -- ignoring Sep 26 10:32:36 ollama[6664]: model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring Sep 26 10:33:28 ollama[6664]: load_tensors: offloading 20 repeating layers to GPU Sep 26 10:33:28 ollama[6664]: load_tensors: offloaded 20/48 layers to GPU Sep 26 10:33:28 ollama[6664]: load_tensors: CUDA0 model buffer size = 14244.71 MiB Sep 26 10:33:28 ollama[6664]: load_tensors: CUDA1 model buffer size = 11870.59 MiB Sep 26 10:33:28 ollama[6664]: load_tensors: CUDA2 model buffer size = 18992.95 MiB Sep 26 10:33:28 ollama[6664]: load_tensors: CPU_Mapped model buffer size = 63231.93 MiB Sep 26 10:33:41 ollama[6664]: llama_context: constructing llama_context Sep 26 10:33:41 ollama[6664]: llama_context: n_seq_max = 1 Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx = 20000 Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx_per_seq = 20000 Sep 26 10:33:41 ollama[6664]: llama_context: n_batch = 512 Sep 26 10:33:41 ollama[6664]: llama_context: n_ubatch = 512 Sep 26 10:33:41 ollama[6664]: llama_context: causal_attn = 1 Sep 26 10:33:41 ollama[6664]: llama_context: flash_attn = 1 Sep 26 10:33:41 ollama[6664]: llama_context: kv_unified = false Sep 26 10:33:41 ollama[6664]: llama_context: freq_base = 1000000.0 Sep 26 10:33:41 ollama[6664]: llama_context: freq_scale = 1 Sep 26 10:33:41 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Sep 26 10:33:41 ollama[6664]: llama_context: CPU output buffer size = 0.59 MiB Sep 26 10:33:41 ollama[6664]: llama_kv_cache_unified: CUDA0 KV buffer size = 474.00 MiB Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified: CUDA1 KV buffer size = 395.00 MiB Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified: CUDA2 KV buffer size = 632.00 MiB Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified: CPU KV buffer size = 2133.00 MiB Sep 26 10:33:42 ollama[6664]: llama_kv_cache_unified: size = 3634.00 MiB ( 20224 cells, 46 layers, 1/1 seqs), K (f16): 1817.00 MiB, V (f16): 1817.00 MiB Sep 26 10:33:42 ollama[6664]: llama_context: CUDA0 compute buffer size = 933.00 MiB Sep 26 10:33:42 ollama[6664]: llama_context: CUDA1 compute buffer size = 166.51 MiB Sep 26 10:33:42 ollama[6664]: llama_context: CUDA2 compute buffer size = 166.51 MiB Sep 26 10:33:42 ollama[6664]: llama_context: CUDA_Host compute buffer size = 47.51 MiB Sep 26 10:33:42 ollama[6664]: llama_context: graph nodes = 3101 Sep 26 10:33:42 ollama[6664]: llama_context: graph splits = 514 (with bs=512), 5 (with bs=1) Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1289 msg="llama runner started in 72.44 seconds" Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:33:42 ollama[6664]: time=2025-09-26T10:33:42.749+03:00 level=INFO source=server.go:1289 msg="llama runner started in 72.44 seconds" Sep 26 10:33:42 ollama[6664]: [GIN] 2025/09/26 - 10:33:42 | 200 | 1m13s | 127.0.0.1 | POST "/api/generate" Sep 26 10:34:23 ollama[6664]: [GIN] 2025/09/26 - 10:34:23 | 200 | 35.476925376s | 127.0.0.1 | POST "/api/chat" Sep 26 10:34:50 ollama[6664]: [GIN] 2025/09/26 - 10:34:50 | 200 | 28.123µs | 127.0.0.1 | HEAD "/" Sep 26 10:34:51 ollama[6664]: [GIN] 2025/09/26 - 10:34:51 | 200 | 44.412727ms | 127.0.0.1 | POST "/api/show" Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="1.2 GiB" Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="2.3 GiB" Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.438+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="2.2 GiB" Sep 26 10:34:51 ollama[6664]: llama_model_loader: loaded meta data with 40 key-value pairs and 569 tensors from /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 (version GGUF V3 (latest)) Sep 26 10:34:51 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = deci Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super_V1_5 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 3: general.finetune str = 3_3-Nemotron-Super-v1_5 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 4: general.basename str = Llama Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 5: general.size_label str = 49B Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 6: general.license str = other Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 7: general.license.name str = nvidia-open-model-license Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 8: general.license.link str = https://www.nvidia.com/en-us/agreemen... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 9: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"] Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 11: deci.rope.freq_base f32 = 500000.000000 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 12: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 13: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 14: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 15: deci.block_count u32 = 80 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 16: deci.context_length u32 = 131072 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 17: deci.embedding_length u32 = 8192 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 18: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 19: deci.attention.key_length u32 = 128 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 20: deci.attention.value_length u32 = 128 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 21: deci.vocab_size u32 = 128256 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 22: deci.rope.dimension_count u32 = 128 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128009 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 128009 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.add_sep_token bool = false Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 33: tokenizer.chat_template str = {% set bos = "<|begin_of_text|>" %}{%... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 34: general.quantization_version u32 = 2 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 35: general.file_type u32 = 7 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 36: quantize.imatrix.file str = Llama-3_3-Nemotron-Super-49B-v1_5/Lla... Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 37: quantize.imatrix.dataset str = calibration_datav3.txt Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 38: quantize.imatrix.entries_count u32 = 436 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - kv 39: quantize.imatrix.chunks_count u32 = 125 Sep 26 10:34:51 ollama[6664]: llama_model_loader: - type f32: 131 tensors Sep 26 10:34:51 ollama[6664]: llama_model_loader: - type q8_0: 438 tensors Sep 26 10:34:51 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:34:51 ollama[6664]: print_info: file type = Q8_0 Sep 26 10:34:51 ollama[6664]: print_info: file size = 49.35 GiB (8.50 BPW) Sep 26 10:34:51 ollama[6664]: load: printing all EOG tokens: Sep 26 10:34:51 ollama[6664]: load: - 128001 ('<|end_of_text|>') Sep 26 10:34:51 ollama[6664]: load: - 128008 ('<|eom_id|>') Sep 26 10:34:51 ollama[6664]: load: - 128009 ('<|eot_id|>') Sep 26 10:34:51 ollama[6664]: load: special tokens cache size = 256 Sep 26 10:34:51 ollama[6664]: load: token to piece cache size = 0.7999 MB Sep 26 10:34:51 ollama[6664]: print_info: arch = deci Sep 26 10:34:51 ollama[6664]: print_info: vocab_only = 1 Sep 26 10:34:51 ollama[6664]: print_info: model type = ?B Sep 26 10:34:51 ollama[6664]: print_info: model params = 49.87 B Sep 26 10:34:51 ollama[6664]: print_info: general.name = Llama_Nemotron_Super_V1_5 Sep 26 10:34:51 ollama[6664]: print_info: vocab type = BPE Sep 26 10:34:51 ollama[6664]: print_info: n_vocab = 128256 Sep 26 10:34:51 ollama[6664]: print_info: n_merges = 280147 Sep 26 10:34:51 ollama[6664]: print_info: BOS token = 128000 '<|begin_of_text|>' Sep 26 10:34:51 ollama[6664]: print_info: EOS token = 128009 '<|eot_id|>' Sep 26 10:34:51 ollama[6664]: print_info: EOT token = 128009 '<|eot_id|>' Sep 26 10:34:51 ollama[6664]: print_info: EOM token = 128008 '<|eom_id|>' Sep 26 10:34:51 ollama[6664]: print_info: PAD token = 128009 '<|eot_id|>' Sep 26 10:34:51 ollama[6664]: print_info: LF token = 198 'Ċ' Sep 26 10:34:51 ollama[6664]: print_info: EOG token = 128001 '<|end_of_text|>' Sep 26 10:34:51 ollama[6664]: print_info: EOG token = 128008 '<|eom_id|>' Sep 26 10:34:51 ollama[6664]: print_info: EOG token = 128009 '<|eot_id|>' Sep 26 10:34:51 ollama[6664]: print_info: max token length = 256 Sep 26 10:34:51 ollama[6664]: llama_model_load: vocab only - skipping tensors Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.971+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.971+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 --port 39567" Sep 26 10:34:51 ollama[6664]: time=2025-09-26T10:34:51.979+03:00 level=INFO source=runner.go:864 msg="starting go runner" Sep 26 10:34:51 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 26 10:34:52 ollama[6664]: ggml_cuda_init: found 3 CUDA devices: Sep 26 10:34:52 ollama[6664]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 26 10:34:52 ollama[6664]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 26 10:34:52 ollama[6664]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 26 10:34:52 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.073+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.073+03:00 level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:39567" Sep 26 10:34:52 ollama[6664]: time=2025-09-26T10:34:52.297+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="166.7 GiB" free_swap="8.0 GiB" Sep 26 10:34:53 ollama[6664]: time=2025-09-26T10:34:53.236+03:00 level=INFO source=server.go:511 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B" Sep 26 10:34:58 ollama[6664]: time=2025-09-26T10:34:58.431+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.194652105 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 Sep 26 10:34:58 ollama[6664]: time=2025-09-26T10:34:58.763+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.526756847 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 Sep 26 10:34:59 ollama[6664]: time=2025-09-26T10:34:59.431+03:00 level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=6.19473457 runner.size="136.2 GiB" runner.vram="73.0 GiB" runner.parallel=1 runner.pid=7079 runner.model=/ai/llm/models/blobs/sha256-cd26a35eed550fcd10351bbaf4039d4560dae1a81895eafd4e88d938a1923745 Sep 26 10:34:59 ollama[6664]: time=2025-09-26T10:34:59.763+03:00 level=INFO source=server.go:504 msg="system memory" total="184.1 GiB" free="169.5 GiB" free_swap="8.0 GiB" Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.083+03:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=81 layers.offload=0 layers.split=[] memory.available="[23.1 GiB 23.1 GiB 31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="52.0 GiB" memory.required.partial="0 B" memory.required.kv="3.7 GiB" memory.required.allocations="[0 B 0 B 0 B]" memory.weights.total="48.3 GiB" memory.weights.repeating="47.3 GiB" memory.weights.nonrepeating="1.0 GiB" memory.graph.full="39.9 GiB" memory.graph.partial="39.9 GiB" Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:35:01 ollama[6664]: time=2025-09-26T10:35:01.084+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23624 MiB free Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23667 MiB free Sep 26 10:35:01 ollama[6664]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) - 32153 MiB free Sep 26 10:35:01 ollama[6664]: llama_model_loader: loaded meta data with 40 key-value pairs and 569 tensors from /ai/llm/models/blobs/sha256-b8517e4413faf9d11cd5bd85e08a5fcf77c29db0d03318401c9eff6063c87e84 (version GGUF V3 (latest)) Sep 26 10:35:01 ollama[6664]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 0: general.architecture str = deci Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 1: general.type str = model Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 2: general.name str = Llama_Nemotron_Super_V1_5 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 3: general.finetune str = 3_3-Nemotron-Super-v1_5 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 4: general.basename str = Llama Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 5: general.size_label str = 49B Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 6: general.license str = other Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 7: general.license.name str = nvidia-open-model-license Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 8: general.license.link str = https://www.nvidia.com/en-us/agreemen... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 9: general.tags arr[str,4] = ["nvidia", "llama-3", "pytorch", "tex... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"] Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 11: deci.rope.freq_base f32 = 500000.000000 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 12: deci.attention.head_count_kv arr[i32,80] = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, ... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 13: deci.attention.head_count arr[i32,80] = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 14: deci.feed_forward_length arr[i32,80] = [14336, 28672, 28672, 28672, 28672, 2... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 15: deci.block_count u32 = 80 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 16: deci.context_length u32 = 131072 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 17: deci.embedding_length u32 = 8192 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 18: deci.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 19: deci.attention.key_length u32 = 128 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 20: deci.attention.value_length u32 = 128 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 21: deci.vocab_size u32 = 128256 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 22: deci.rope.dimension_count u32 = 128 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128009 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 128009 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 32: tokenizer.ggml.add_sep_token bool = false Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 33: tokenizer.chat_template str = {% set bos = "<|begin_of_text|>" %}{%... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 34: general.quantization_version u32 = 2 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 35: general.file_type u32 = 7 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 36: quantize.imatrix.file str = Llama-3_3-Nemotron-Super-49B-v1_5/Lla... Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 37: quantize.imatrix.dataset str = calibration_datav3.txt Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 38: quantize.imatrix.entries_count u32 = 436 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - kv 39: quantize.imatrix.chunks_count u32 = 125 Sep 26 10:35:01 ollama[6664]: llama_model_loader: - type f32: 131 tensors Sep 26 10:35:01 ollama[6664]: llama_model_loader: - type q8_0: 438 tensors Sep 26 10:35:01 ollama[6664]: print_info: file format = GGUF V3 (latest) Sep 26 10:35:01 ollama[6664]: print_info: file type = Q8_0 Sep 26 10:35:01 ollama[6664]: print_info: file size = 49.35 GiB (8.50 BPW) Sep 26 10:35:01 ollama[6664]: load: printing all EOG tokens: Sep 26 10:35:01 ollama[6664]: load: - 128001 ('<|end_of_text|>') Sep 26 10:35:01 ollama[6664]: load: - 128008 ('<|eom_id|>') Sep 26 10:35:01 ollama[6664]: load: - 128009 ('<|eot_id|>') Sep 26 10:35:01 ollama[6664]: load: special tokens cache size = 256 Sep 26 10:35:01 ollama[6664]: load: token to piece cache size = 0.7999 MB Sep 26 10:35:01 ollama[6664]: print_info: arch = deci Sep 26 10:35:01 ollama[6664]: print_info: vocab_only = 0 Sep 26 10:35:01 ollama[6664]: print_info: n_ctx_train = 131072 Sep 26 10:35:01 ollama[6664]: print_info: n_embd = 8192 Sep 26 10:35:01 ollama[6664]: print_info: n_layer = 80 Sep 26 10:35:01 ollama[6664]: print_info: n_head = [64, 64, 64, 64, 64, 64, 0, 0, 64, 64, 64, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 64, 64, 64, 64, 64, 64, 64, 64, 64] Sep 26 10:35:01 ollama[6664]: print_info: n_head_kv = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] Sep 26 10:35:01 ollama[6664]: print_info: n_rot = 128 Sep 26 10:35:01 ollama[6664]: print_info: n_swa = 0 Sep 26 10:35:01 ollama[6664]: print_info: is_swa_any = 0 Sep 26 10:35:01 ollama[6664]: print_info: n_embd_head_k = 128 Sep 26 10:35:01 ollama[6664]: print_info: n_embd_head_v = 128 Sep 26 10:35:01 ollama[6664]: print_info: n_gqa = [8, 8, 8, 8, 8, 8, 0, 0, 8, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 8] Sep 26 10:35:01 ollama[6664]: print_info: n_embd_k_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] Sep 26 10:35:01 ollama[6664]: print_info: n_embd_v_gqa = [1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024] Sep 26 10:35:01 ollama[6664]: print_info: f_norm_eps = 0.0e+00 Sep 26 10:35:01 ollama[6664]: print_info: f_norm_rms_eps = 1.0e-05 Sep 26 10:35:01 ollama[6664]: print_info: f_clamp_kqv = 0.0e+00 Sep 26 10:35:01 ollama[6664]: print_info: f_max_alibi_bias = 0.0e+00 Sep 26 10:35:01 ollama[6664]: print_info: f_logit_scale = 0.0e+00 Sep 26 10:35:01 ollama[6664]: print_info: f_attn_scale = 0.0e+00 Sep 26 10:35:01 ollama[6664]: print_info: n_ff = [14336, 28672, 28672, 28672, 28672, 28672, 14336, 14336, 28672, 28672, 28672, 17920, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 7168, 14336, 14336, 7168, 28672, 7168, 14336, 7168, 7168, 7168, 28672, 7168, 5632, 5632, 7168, 5632, 5632, 5632, 7168, 7168, 2816, 2816, 5632, 5632, 2816, 2816, 5632, 2816, 2816, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672, 28672] Sep 26 10:35:01 ollama[6664]: print_info: n_expert = 0 Sep 26 10:35:01 ollama[6664]: print_info: n_expert_used = 0 Sep 26 10:35:01 ollama[6664]: print_info: causal attn = 1 Sep 26 10:35:01 ollama[6664]: print_info: pooling type = 0 Sep 26 10:35:01 ollama[6664]: print_info: rope type = 0 Sep 26 10:35:01 ollama[6664]: print_info: rope scaling = linear Sep 26 10:35:01 ollama[6664]: print_info: freq_base_train = 500000.0 Sep 26 10:35:01 ollama[6664]: print_info: freq_scale_train = 1 Sep 26 10:35:01 ollama[6664]: print_info: n_ctx_orig_yarn = 131072 Sep 26 10:35:01 ollama[6664]: print_info: rope_finetuned = unknown Sep 26 10:35:01 ollama[6664]: print_info: model type = 70B Sep 26 10:35:01 ollama[6664]: print_info: model params = 49.87 B Sep 26 10:35:01 ollama[6664]: print_info: general.name = Llama_Nemotron_Super_V1_5 Sep 26 10:35:01 ollama[6664]: print_info: vocab type = BPE Sep 26 10:35:01 ollama[6664]: print_info: n_vocab = 128256 Sep 26 10:35:01 ollama[6664]: print_info: n_merges = 280147 Sep 26 10:35:01 ollama[6664]: print_info: BOS token = 128000 '<|begin_of_text|>' Sep 26 10:35:01 ollama[6664]: print_info: EOS token = 128009 '<|eot_id|>' Sep 26 10:35:01 ollama[6664]: print_info: EOT token = 128009 '<|eot_id|>' Sep 26 10:35:01 ollama[6664]: print_info: EOM token = 128008 '<|eom_id|>' Sep 26 10:35:01 ollama[6664]: print_info: PAD token = 128009 '<|eot_id|>' Sep 26 10:35:01 ollama[6664]: print_info: LF token = 198 'Ċ' Sep 26 10:35:01 ollama[6664]: print_info: EOG token = 128001 '<|end_of_text|>' Sep 26 10:35:01 ollama[6664]: print_info: EOG token = 128008 '<|eom_id|>' Sep 26 10:35:01 ollama[6664]: print_info: EOG token = 128009 '<|eot_id|>' Sep 26 10:35:01 ollama[6664]: print_info: max token length = 256 Sep 26 10:35:01 ollama[6664]: load_tensors: loading model tensors, this can take a while... (mmap = false) Sep 26 10:35:23 ollama[6664]: load_tensors: offloading 0 repeating layers to GPU Sep 26 10:35:23 ollama[6664]: load_tensors: offloaded 0/81 layers to GPU Sep 26 10:35:23 ollama[6664]: load_tensors: CUDA_Host model buffer size = 50532.31 MiB Sep 26 10:35:59 ollama[6664]: llama_context: constructing llama_context Sep 26 10:35:59 ollama[6664]: llama_context: n_seq_max = 1 Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx = 20000 Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx_per_seq = 20000 Sep 26 10:35:59 ollama[6664]: llama_context: n_batch = 512 Sep 26 10:35:59 ollama[6664]: llama_context: n_ubatch = 512 Sep 26 10:35:59 ollama[6664]: llama_context: causal_attn = 1 Sep 26 10:35:59 ollama[6664]: llama_context: flash_attn = 1 Sep 26 10:35:59 ollama[6664]: llama_context: kv_unified = false Sep 26 10:35:59 ollama[6664]: llama_context: freq_base = 500000.0 Sep 26 10:35:59 ollama[6664]: llama_context: freq_scale = 1 Sep 26 10:35:59 ollama[6664]: llama_context: n_ctx_per_seq (20000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Sep 26 10:35:59 ollama[6664]: llama_context: CPU output buffer size = 0.52 MiB Sep 26 10:35:59 ollama[6664]: llama_kv_cache_unified: CPU KV buffer size = 3871.00 MiB Sep 26 10:36:00 ollama[6664]: llama_kv_cache_unified: size = 3871.00 MiB ( 20224 cells, 80 layers, 1/1 seqs), K (f16): 1935.50 MiB, V (f16): 1935.50 MiB Sep 26 10:36:00 ollama[6664]: llama_context: CUDA0 compute buffer size = 1331.12 MiB Sep 26 10:36:00 ollama[6664]: llama_context: CUDA_Host compute buffer size = 55.51 MiB Sep 26 10:36:00 ollama[6664]: llama_context: graph nodes = 1743 Sep 26 10:36:00 ollama[6664]: llama_context: graph splits = 668 (with bs=512), 1 (with bs=1) Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.546+03:00 level=INFO source=server.go:1289 msg="llama runner started in 68.57 seconds" Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.547+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=1 Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.547+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:36:00 ollama[6664]: time=2025-09-26T10:36:00.548+03:00 level=INFO source=server.go:1289 msg="llama runner started in 68.58 seconds" Sep 26 10:36:00 ollama[6664]: [GIN] 2025/09/26 - 10:36:00 | 200 | 1m9s | 127.0.0.1 | POST "/api/generate" Sep 26 10:41:19 ollama[6664]: [GIN] 2025/09/26 - 10:41:19 | 200 | 5m17s | 127.0.0.1 | POST "/api/chat" Sep 26 10:43:06 ollama[6664]: [GIN] 2025/09/26 - 10:43:06 | 200 | 20.489µs | 127.0.0.1 | HEAD "/" Sep 26 10:43:07 ollama[6664]: [GIN] 2025/09/26 - 10:43:07 | 200 | 43.440654ms | 127.0.0.1 | POST "/api/show" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=cuda total="23.5 GiB" available="22.6 GiB" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=cuda total="23.5 GiB" available="22.6 GiB" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.456+03:00 level=INFO source=sched.go:537 msg="updated VRAM based on existing loaded models" gpu=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=cuda total="31.7 GiB" available="31.0 GiB" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.821+03:00 level=INFO source=server.go:217 msg="enabling flash attention" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.828+03:00 level=INFO source=server.go:399 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /ai/llm/models/blobs/sha256-0a79b00d4d08bda6eccd6b1d0317defe8e1c0ece06b6fc4745ec17a7323d5899 --port 35961" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.828+03:00 level=INFO source=server.go:672 msg="loading model" "model layers"=49 requested=-1 Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.835+03:00 level=INFO source=runner.go:1252 msg="starting ollama engine" Sep 26 10:43:07 ollama[6664]: time=2025-09-26T10:43:07.836+03:00 level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:35961" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:678 msg="system memory" total="184.1 GiB" free="121.2 GiB" free_swap="1.4 GiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe available="22.1 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 available="22.2 GiB" free="22.6 GiB" minimum="457.0 MiB" overhead="0 B" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.141+03:00 level=INFO source=server.go:686 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 available="30.6 GiB" free="31.0 GiB" minimum="457.0 MiB" overhead="0 B" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.142+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.163+03:00 level=INFO source=ggml.go:131 msg="" architecture=qwen3moe file_type=Q8_0 name="Qwen3 30B A3B Thinking 2507" description="" num_tensors=579 num_key_values=33 Sep 26 10:43:08 ollama[6664]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 26 10:43:08 ollama[6664]: ggml_cuda_init: found 3 CUDA devices: Sep 26 10:43:08 ollama[6664]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Sep 26 10:43:08 ollama[6664]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Sep 26 10:43:08 ollama[6664]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Sep 26 10:43:08 ollama[6664]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.247+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.291+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.391+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:20000 KvCacheType: NumThreads:16 GPULayers:49[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:28(0..27) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:21(28..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:487 msg="offloading 48 repeating layers to GPU" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=ggml.go:498 msg="offloaded 49/49 layers to GPU" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="12.7 GiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="17.3 GiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="315.3 MiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="790.0 MiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="1.1 GiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="96.3 MiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="103.8 MiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="4.0 MiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=backend.go:342 msg="total memory" size="32.3 GiB" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=sched.go:470 msg="loaded runners" count=2 Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" Sep 26 10:43:08 ollama[6664]: time=2025-09-26T10:43:08.648+03:00 level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" Sep 26 10:43:21 ollama[6664]: time=2025-09-26T10:43:21.177+03:00 level=INFO source=server.go:1289 msg="llama runner started in 13.35 seconds" Sep 26 10:43:21 ollama[6664]: [GIN] 2025/09/26 - 10:43:21 | 200 | 14.148875755s | 127.0.0.1 | POST "/api/generate" Sep 26 10:43:43 ollama[6664]: [GIN] 2025/09/26 - 10:43:43 | 200 | 12.909124255s | 127.0.0.1 | POST "/api/chat" ``` **TL:DR** Magistral-Small-2509-GGUF:BF16: **PASS** GLM-4.5-Air-Q8_0:latest: **FAIL** Llama-3_3-Nemotron-Super-49B-v1_5-GGUF:Q8_0: **PASS** qwen3:30b-a3b-thinking-2507-q8_0: **PASS** gpt-oss: **FAIL**

GiteaMirror commented

2026-04-12 20:44:29 -05:00

@Goreg12345 commented on GitHub (Sep 27, 2025):

I'm experiencing the same problem with ollama=0.12.3 and nvidia V100 32GB using gpt-oss:20b. It seems to output answers to unrelated topics and they are corrupted; they are still natural language but weird and the thinking tokens aren't set properly.

Turning off flash-attention didn't help and actually made it worse. Now, the model only answers "Ok." and other corrupt outputs.

I tried gemma3:12b and llama3.2:3b and they both work fine.

@Goreg12345 commented on GitHub (Sep 27, 2025): I'm experiencing the same problem with ollama=0.12.3 and nvidia V100 32GB using gpt-oss:20b. It seems to output answers to unrelated topics and they are corrupted; they are still natural language but weird and the thinking tokens aren't set properly. Turning off flash-attention didn't help and actually made it worse. Now, the model only answers "Ok." and other corrupt outputs. I tried gemma3:12b and llama3.2:3b and they both work fine.

GiteaMirror commented

2026-04-12 20:44:29 -05:00

@jessegross commented on GitHub (Oct 7, 2025):

For those that are running into this issue, please test out one of the 0.12.4 RCs and let us know if it fixes the issue with V100s - we have updated the kernels.

@jessegross commented on GitHub (Oct 7, 2025): For those that are running into this issue, please test out one of the 0.12.4 RCs and let us know if it fixes the issue with V100s - we have updated the kernels.

GiteaMirror commented

2026-04-12 20:44:31 -05:00

@ka-admin commented on GitHub (Oct 8, 2025):

For those that are running into this issue, please test out one of the 0.12.4 RCs and let us know if it fixes the issue with V100s - we have updated the kernels.

ollama version is 0.12.4-rc6
:/usr/local/bin# ollama run gpt-oss:120b
>>> hello
Thinking...
We need to respond as ChatGPT. The user says "hello". So respond friendly.
...done thinking.

Hello! How can I help you today?

>>> introduce yourself
Thinking...
We need to introduce ourselves. Should be friendly, mention capabilities, etc. Also mention we are ChatGPT, large language model, trained by OpenAI, knowledge cutoff 2024-06, date 2025-10-08, can assist with many
topics. Also perhaps ask user what they need.
...done thinking.

Hi there! I’m ChatGPT, a large‑language model created by OpenAI. I’m designed to understand and generate natural‑language text, so I can help you with a wide range of tasks—answering questions, brainstorming
ideas, explaining concepts, drafting emails or stories, troubleshooting code, learning new topics, practicing languages, planning trips, and much more.

A few quick facts about me:

- **Training:** I was trained on a diverse mix of books, articles, websites, and other text sources up until June 2024.
- **Current date:** October 8 2025, so I’m aware of events and trends up to mid‑2024, but I don’t have real‑time internet access.
- **Strengths:** Summarizing long documents, translating between many languages, generating creative writing, solving math or programming problems, and providing step‑by‑step explanations.
- **Limitations:** I can’t browse the web, access personal data unless you share it, or guarantee absolute factual accuracy—especially for very recent or niche information. I also don’t have personal experiences
or emotions; I’m a pattern‑based AI.

I’m here to make information and creativity more accessible, so let me know what you’d like to work on or talk about!

>>> Send a message (/? for help)

Looks ok for gpt-oss.

ollama run GLM-4.5-Air-Q8_0:latest
>>> Hi, introduce youself
 in this thread and tell us what makes you unique. What are your interests? Why did you join the forum?
I'll start.
My name's Alex (short for Alexander) and I'm 21 years old, a student of English literature at university and currently in my final year of study. I've always loved to write and read stories from fantasy/science
fiction genres and have been writing for as long as I can remember.
What makes me unique? Well, it's probably my passion for things others might find strange or niche. I'm a huge fan of anime (particularly psychological thrillers and mecha), tabletop RPGs (I've run campaigns in DnD
5e, Pathfinder 2E, and Starfinder) and strategy games like Civilization VI and Total War: Warhammer III. I also have a knack for finding humor in the mundane and often spend my free time making puns or joking with
friends.
My interest in writing led me to join this forum. I've always enjoyed crafting worlds and characters and am currently working on a novel (though university^C

>>> Send a message (/? for help)

@ka-admin commented on GitHub (Oct 8, 2025): > For those that are running into this issue, please test out one of the 0.12.4 RCs and let us know if it fixes the issue with V100s - we have updated the kernels. ``` ollama version is 0.12.4-rc6 :/usr/local/bin# ollama run gpt-oss:120b >>> hello Thinking... We need to respond as ChatGPT. The user says "hello". So respond friendly. ...done thinking. Hello! How can I help you today? >>> introduce yourself Thinking... We need to introduce ourselves. Should be friendly, mention capabilities, etc. Also mention we are ChatGPT, large language model, trained by OpenAI, knowledge cutoff 2024-06, date 2025-10-08, can assist with many topics. Also perhaps ask user what they need. ...done thinking. Hi there! I’m ChatGPT, a large‑language model created by OpenAI. I’m designed to understand and generate natural‑language text, so I can help you with a wide range of tasks—answering questions, brainstorming ideas, explaining concepts, drafting emails or stories, troubleshooting code, learning new topics, practicing languages, planning trips, and much more. A few quick facts about me: - **Training:** I was trained on a diverse mix of books, articles, websites, and other text sources up until June 2024. - **Current date:** October 8 2025, so I’m aware of events and trends up to mid‑2024, but I don’t have real‑time internet access. - **Strengths:** Summarizing long documents, translating between many languages, generating creative writing, solving math or programming problems, and providing step‑by‑step explanations. - **Limitations:** I can’t browse the web, access personal data unless you share it, or guarantee absolute factual accuracy—especially for very recent or niche information. I also don’t have personal experiences or emotions; I’m a pattern‑based AI. I’m here to make information and creativity more accessible, so let me know what you’d like to work on or talk about! >>> Send a message (/? for help) ``` Looks ok for gpt-oss. ``` ollama run GLM-4.5-Air-Q8_0:latest >>> Hi, introduce youself in this thread and tell us what makes you unique. What are your interests? Why did you join the forum? I'll start. My name's Alex (short for Alexander) and I'm 21 years old, a student of English literature at university and currently in my final year of study. I've always loved to write and read stories from fantasy/science fiction genres and have been writing for as long as I can remember. What makes me unique? Well, it's probably my passion for things others might find strange or niche. I'm a huge fan of anime (particularly psychological thrillers and mecha), tabletop RPGs (I've run campaigns in DnD 5e, Pathfinder 2E, and Starfinder) and strategy games like Civilization VI and Total War: Warhammer III. I also have a knack for finding humor in the mundane and often spend my free time making puns or joking with friends. My interest in writing led me to join this forum. I've always enjoyed crafting worlds and characters and am currently working on a novel (though university^C >>> Send a message (/? for help) ```

GiteaMirror commented

2026-04-12 20:44:31 -05:00

@jessegross commented on GitHub (Oct 8, 2025):

Thanks for checking, I'll go ahead and close this.

@jessegross commented on GitHub (Oct 8, 2025): Thanks for checking, I'll go ahead and close this.

GiteaMirror referenced this issue

2026-04-13 00:06:39 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #12674

GiteaMirror referenced this issue

2026-04-16 06:20:11 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #17945

GiteaMirror referenced this issue

2026-04-19 16:50:29 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #23214

GiteaMirror referenced this issue

2026-04-22 23:13:05 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #38547

GiteaMirror referenced this issue

2026-04-24 23:28:45 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #43922

GiteaMirror referenced this issue

2026-04-29 14:18:14 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #59371

GiteaMirror referenced this issue

2026-05-05 07:18:34 -05:00

[PR #8237] [CLOSED] Changes macOS installer to skip symlink step if ollama is already in path. #74968

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#8237