[GH-ISSUE #10670] On multi-GPU systems, the context should be loaded into the GPU with the most available memory. #53527

New Issue

GiteaMirror · 2026-04-29T03:31:28-05:00

GiteaMirror commented

2026-04-29 03:31:28 -05:00

Originally created by @bitcandy on GitHub (May 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10670

In my multi-GPU system, the gemma3:12b-it-qat model begins loading onto the GPU with the least available memory, leading to a 500 error (memory issue) when using a context window exceeding 1,000 tokens (anywhere between 1,000 and 100,000 tokens). After some investigation, I found that the model primarily uses a GPU with just 6GB of free memory, even though another GPU has 10GB free—or even about 5GB free after splitting the model across multiple GPUs—which should be sufficient for the context window.

My proposal is that on multi-GPU systems, once the model has loaded, the context should be placed on the GPU with the most available memory. It appears that Ollama could benefit from an additional check for free memory before deciding where to load the context window.

P.S. If context is loaded onto all GPUs by design (I don't know), then the Ollama loading system should utilize more memory from the GPU with the largest available VRAM

ps. gemma3:12b work well with any context window at the same conditions....

Originally created by @bitcandy on GitHub (May 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10670 In my multi-GPU system, the gemma3:12b-it-qat model begins loading onto the GPU with the least available memory, leading to a 500 error (memory issue) when using a context window exceeding 1,000 tokens (anywhere between 1,000 and 100,000 tokens). After some investigation, I found that the model primarily uses a GPU with just 6GB of free memory, even though another GPU has 10GB free—or even about 5GB free after splitting the model across multiple GPUs—which should be sufficient for the context window. My proposal is that on multi-GPU systems, once the model has loaded, the context should be placed on the GPU with the most available memory. It appears that Ollama could benefit from an additional check for free memory before deciding where to load the context window. P.S. If context is loaded onto all GPUs by design (I don't know), then the Ollama loading system should utilize more memory from the GPU with the largest available VRAM ps. gemma3:12b work well with any context window at the same conditions....

GiteaMirror added the feature request label 2026-04-29 03:31:28 -05:00

GiteaMirror closed this issue

2026-04-29 03:33:20 -05:00

GiteaMirror commented

2026-04-29 03:33:22 -05:00

@rick-github commented on GitHub (May 12, 2025):

ollama estimates memory requirements and then round-robin assigns layers to the available devices. If a device is full, layer assignment is done on the remaining devices until all the layers are assigned. This sounds more like an OOM issue due to inaccuracies in memory estimation. Server logs will show details of memory estimation. Generic ways for dealing with OOM can be found here.

@rick-github commented on GitHub (May 12, 2025): ollama estimates memory requirements and then round-robin assigns layers to the available devices. If a device is full, layer assignment is done on the remaining devices until all the layers are assigned. This sounds more like an OOM issue due to inaccuracies in memory estimation. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show details of memory estimation. Generic ways for dealing with OOM can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).

GiteaMirror commented

2026-04-29 03:33:23 -05:00

@bitcandy commented on GitHub (May 12, 2025):

@rick-github thank you, your answer help a lot to find new ways for optimization and now I can run gemma3:12b-it-qat at the same conditions with this
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_NUM_PARALLEL=1"
(I think NUM_PARALLEL helped and not sure that flash attention actually worked (at the log I see only this
May 12 11:33:43 m3pc ollama[173536]: time=2025-05-12T11:33:43.747Z level=INFO source=server.go:186 msg="enabling flash attention"
and nothing else about flash.
)

But i still can't understand why it does not load almost everything to GTX 1080 Ti at least up to 10000 mb, before use other cards? And the second question why it use CPU, when the system still have so big amount of free VRAM?

ollama ps
NAME                 ID              SIZE     PROCESSOR          UNTIL               
gemma3:12b-it-qat    5d4fa005e7bb    26 GB    24%/76% CPU/GPU    59 minutes from now  

| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1070        Off |   00000000:01:00.0 Off |                  N/A |
| 53%   69C    P2            189W /  195W |    5135MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:03:00.0 Off |                  N/A |
| 54%   70C    P2            277W /  280W |    7221MiB /  11264MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1070        Off |   00000000:05:00.0 Off |                  N/A |
| 64%   74C    P2            193W /  195W |    4285MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2190MiB |
|    0   N/A  N/A    181011      C   /usr/local/bin/ollama                        2878MiB |
|    1   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2306MiB |
|    1   N/A  N/A    181011      C   /usr/local/bin/ollama                        4886MiB |
|    2   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2190MiB |
|    2   N/A  N/A    181011      C   /usr/local/bin/ollama                        2086MiB |
+-----------------------------------------------------------------------------------------+


May 12 12:07:07 m3pc systemd[1]: Started Ollama Service.
May 12 12:07:07 m3pc ollama[180943]: 2025/05/12 12:07:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.378Z level=INFO source=images.go:463 msg="total blobs: 84"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.380Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.6 GiB"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 |      64.303µs |       127.0.0.1 | HEAD     "/"
May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 |    2.333068ms |       127.0.0.1 | GET      "/api/ps"
May 12 12:07:47 m3pc ollama[180943]: time=2025-05-12T12:07:47.885Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.406Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.440Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:50 m3pc ollama[180943]: time=2025-05-12T12:07:50.801Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="4.5 MiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:186 msg="enabling flash attention"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.299Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.301Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 41 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 40001"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=sched.go:452 msg="loaded runners" count=1
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.322Z level=INFO source=runner.go:851 msg="starting ollama engine"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.323Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:40001"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.374Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: found 3 CUDA devices:
May 12 12:07:51 m3pc ollama[180943]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]:   Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]:   Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.560Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.674Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="5.4 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="2.5 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.5 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="842.1 MiB"
May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 |     165.305µs |       127.0.0.1 | HEAD     "/"
May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 |     102.095µs |       127.0.0.1 | GET      "/api/ps"
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.243Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="563.8 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="135.3 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="135.3 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.100Z level=INFO source=server.go:628 msg="llama runner started in 7.79 seconds"
May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 |      25.018µs |       127.0.0.1 | HEAD     "/"
May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 |       26.89µs |       127.0.0.1 | GET      "/api/ps"

@bitcandy commented on GitHub (May 12, 2025): @rick-github thank you, your answer help a lot to find new ways for optimization and now I can run gemma3:12b-it-qat at the same conditions with this Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_NUM_PARALLEL=1" (I think NUM_PARALLEL helped and not sure that flash attention actually worked (at the log I see only this May 12 11:33:43 m3pc ollama[173536]: time=2025-05-12T11:33:43.747Z level=INFO source=server.go:186 msg="enabling flash attention" and nothing else about flash. ) But i still can't understand why it does not load almost everything to GTX 1080 Ti at least up to 10000 mb, before use other cards? And the second question why it use CPU, when the system still have so big amount of free VRAM? ``` ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:12b-it-qat 5d4fa005e7bb 26 GB 24%/76% CPU/GPU 59 minutes from now | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A | | 53% 69C P2 189W / 195W | 5135MiB / 8192MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:03:00.0 Off | N/A | | 54% 70C P2 277W / 280W | 7221MiB / 11264MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A | | 64% 74C P2 193W / 195W | 4285MiB / 8192MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2190MiB | | 0 N/A N/A 181011 C /usr/local/bin/ollama 2878MiB | | 1 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2306MiB | | 1 N/A N/A 181011 C /usr/local/bin/ollama 4886MiB | | 2 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2190MiB | | 2 N/A N/A 181011 C /usr/local/bin/ollama 2086MiB | +-----------------------------------------------------------------------------------------+ May 12 12:07:07 m3pc systemd[1]: Started Ollama Service. May 12 12:07:07 m3pc ollama[180943]: 2025/05/12 12:07:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.378Z level=INFO source=images.go:463 msg="total blobs: 84" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.380Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.6 GiB" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 | 64.303µs | 127.0.0.1 | HEAD "/" May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 | 2.333068ms | 127.0.0.1 | GET "/api/ps" May 12 12:07:47 m3pc ollama[180943]: time=2025-05-12T12:07:47.885Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.406Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.440Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:50 m3pc ollama[180943]: time=2025-05-12T12:07:50.801Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="4.5 MiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:186 msg="enabling flash attention" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.299Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.301Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 41 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 40001" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=sched.go:452 msg="loaded runners" count=1 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.322Z level=INFO source=runner.go:851 msg="starting ollama engine" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.323Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:40001" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.374Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: found 3 CUDA devices: May 12 12:07:51 m3pc ollama[180943]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.560Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.674Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="5.4 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="2.5 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.5 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="842.1 MiB" May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 | 165.305µs | 127.0.0.1 | HEAD "/" May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 | 102.095µs | 127.0.0.1 | GET "/api/ps" May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.243Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="563.8 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="135.3 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="135.3 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.100Z level=INFO source=server.go:628 msg="llama runner started in 7.79 seconds" May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 | 25.018µs | 127.0.0.1 | HEAD "/" May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 | 26.89µs | 127.0.0.1 | GET "/api/ps" ```

GiteaMirror commented

2026-04-29 03:33:26 -05:00

@rick-github commented on GitHub (May 12, 2025):

May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload
 library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7
 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB"
 memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]"
 memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB"
 memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"

The ollama server estimates that it can offload 41 of 49 layers. However, since flash attention is enabled, the runner doesn't use as much VRAM as the server estimated, so the VRAM is under-utilized. You can override the estimation by setting num_gpu as described here.

@rick-github commented on GitHub (May 12, 2025): ``` May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" ``` The ollama server estimates that it can offload 41 of 49 layers. However, since flash attention is enabled, the runner doesn't use as much VRAM as the server estimated, so the VRAM is under-utilized. You can override the estimation by setting `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).

GiteaMirror commented

2026-04-29 03:33:29 -05:00

@bitcandy commented on GitHub (May 12, 2025):

@rick-github with flash attention disabled it is exactly the same estimation and exactly the same high under-utilization of VRAM :-(
About num_gpu... it can make things only worse. I want to load all model to GPUs as I have enough VRAM for that, but with num_gpu=0 it will load everything to cpu... num_gpu=49 also will not actually override the behavior and will not help utilize more VRAM. :-(

@bitcandy commented on GitHub (May 12, 2025): @rick-github with flash attention disabled it is exactly the same estimation and exactly the same high under-utilization of VRAM :-( About `num_gpu`... it can make things only worse. I want to load all model to GPUs as I have enough VRAM for that, but with `num_gpu=0` it will load everything to cpu... `num_gpu=49` also will not actually override the behavior and will not help utilize more VRAM. :-(

GiteaMirror commented

2026-04-29 03:33:30 -05:00

@rick-github commented on GitHub (May 12, 2025):

Yes, flash attention will make the same estimation because the estimation is done by the ollama server which doesn't know a bout the VRAM savings from flash attention.

num_gpu=49 will override the estimation and utilize more VRAM. If you are finding that it doesn't, server logs may show why.

@rick-github commented on GitHub (May 12, 2025): Yes, flash attention will make the same estimation because the estimation is done by the ollama server which doesn't know a bout the VRAM savings from flash attention. `num_gpu=49` will override the estimation and utilize more VRAM. If you are finding that it doesn't, server logs may show why.

GiteaMirror commented

2026-04-29 03:33:31 -05:00

@bitcandy commented on GitHub (May 12, 2025):

@rick-github
Tried again.
OLLAMA_FLASH_ATTENTION:true
I see at the log that it submit 49 to runner...but without success as said before

ollama ps 
gemma3:12b-it-qat    5d4fa005e7bb    26 GB    24%/76% CPU/GPU    59 minutes from now   

utilization is exactly the same :-(


May 12 12:56:16 m3pc systemd[1]: Started Ollama Service.
May 12 12:56:16 m3pc ollama[195254]: 2025/05/12 12:56:16 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:f8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.588Z level=INFO source=images.go:463 msg="total blobs: 84"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 |      65.812µs |       127.0.0.1 | HEAD     "/"
May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 |     189.582µs |       127.0.0.1 | GET      "/api/ps"
May 12 12:56:49 m3pc ollama[195254]: time=2025-05-12T12:56:49.875Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.392Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.433Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.266Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="2.7 MiB"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:186 msg="enabling flash attention"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=f8_0
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.782Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.784Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=sched.go:452 msg="loaded runners" count=1
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:851 msg="starting ollama engine"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44657"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.855Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
May 12 12:56:57 m3pc ollama[195254]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: found 3 CUDA devices:
May 12 12:56:57 m3pc ollama[195254]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
May 12 12:56:57 m3pc ollama[195254]:   Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:56:57 m3pc ollama[195254]:   Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.043Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
May 12 12:56:58 m3pc ollama[195254]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.204Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="3.5 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="3.1 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.8 GiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.292Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="354.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="384.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="384.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.850Z level=INFO source=server.go:628 msg="llama runner started in 8.06 seconds"

@bitcandy commented on GitHub (May 12, 2025): @rick-github Tried again. OLLAMA_FLASH_ATTENTION:true I see at the log that it submit 49 to runner...but without success as said before ``` ollama ps gemma3:12b-it-qat 5d4fa005e7bb 26 GB 24%/76% CPU/GPU 59 minutes from now utilization is exactly the same :-( May 12 12:56:16 m3pc systemd[1]: Started Ollama Service. May 12 12:56:16 m3pc ollama[195254]: 2025/05/12 12:56:16 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:f8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.588Z level=INFO source=images.go:463 msg="total blobs: 84" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 | 65.812µs | 127.0.0.1 | HEAD "/" May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 | 189.582µs | 127.0.0.1 | GET "/api/ps" May 12 12:56:49 m3pc ollama[195254]: time=2025-05-12T12:56:49.875Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.392Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.433Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.266Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="2.7 MiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:186 msg="enabling flash attention" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=f8_0 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.782Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.784Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=sched.go:452 msg="loaded runners" count=1 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:851 msg="starting ollama engine" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44657" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.855Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 May 12 12:56:57 m3pc ollama[195254]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: found 3 CUDA devices: May 12 12:56:57 m3pc ollama[195254]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes May 12 12:56:57 m3pc ollama[195254]: Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:56:57 m3pc ollama[195254]: Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.043Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" May 12 12:56:58 m3pc ollama[195254]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.204Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="3.5 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="3.1 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.8 GiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.292Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="354.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="384.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="384.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.850Z level=INFO source=server.go:628 msg="llama runner started in 8.06 seconds" ```

GiteaMirror commented

2026-04-29 03:33:32 -05:00

@rick-github commented on GitHub (May 12, 2025):

May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload
 library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7
 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB"
 memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]"
 memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB"
 memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"

May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657"

All 49 layers are offloaded to the GPU. The output of ollama ps is inaccurate because num_gpu was overridden.

@rick-github commented on GitHub (May 12, 2025): ``` May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657" ``` All 49 layers are offloaded to the GPU. The output of `ollama ps` is inaccurate because `num_gpu` was overridden.

GiteaMirror commented

2026-04-29 03:33:33 -05:00

@bitcandy commented on GitHub (May 12, 2025):

Thank you very much. Sorry that I trouble you with my questions.

Is it other way to check actual cpu /gpu utilization expect ollama ps and the speed of execution?

p.s. also hope it will be possible to estimate layers better with ollama in the future :-)

@bitcandy commented on GitHub (May 12, 2025): Thank you very much. Sorry that I trouble you with my questions. Is it other way to check actual cpu /gpu utilization expect `ollama ps` and the speed of execution? p.s. also hope it will be possible to estimate layers better with ollama in the future :-)

GiteaMirror commented

2026-04-29 03:33:34 -05:00

@rick-github commented on GitHub (May 12, 2025):

Is it other way to check actual cpu /gpu utilization expect ollama ps and the speed of execution?

Currently the logs contain the most accurate information. The inaccurate ollama ps output is an open issue in #7597

p.s. also hope it will be possible to estimate layers better with ollama in the future :-)

#6160

@rick-github commented on GitHub (May 12, 2025): > Is it other way to check actual cpu /gpu utilization expect `ollama ps` and the speed of execution? Currently the logs contain the most accurate information. The inaccurate `ollama ps` output is an open issue in #7597 > p.s. also hope it will be possible to estimate layers better with ollama in the future :-) #6160

GiteaMirror commented

2026-04-29 03:33:36 -05:00

@bitcandy commented on GitHub (May 12, 2025):

@rick-github At my last log present:

May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB"

You said that all 49 layers are offloaded to the GPU. Why it use CPU buffer 1.9 GB ? Is it expected?
Is it the same issue that "ollama don't know" with flash enabled ? Then what tell exactly 100% at the log without mistakes that it run fully at GPU?...

@bitcandy commented on GitHub (May 12, 2025): @rick-github At my last log present: > May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB" You said that all 49 layers are offloaded to the GPU. Why it use CPU buffer 1.9 GB ? Is it expected? Is it the same issue that "ollama don't know" with flash enabled ? Then what tell exactly 100% at the log without mistakes that it run fully at GPU?...

GiteaMirror commented

2026-04-29 03:33:36 -05:00

@rick-github commented on GitHub (May 12, 2025):

Some tensors may not be supported by your GPU. Set OLLAMA_DEBUG=1 in the server environment to see tensor assignment.

@rick-github commented on GitHub (May 12, 2025): Some tensors may not be supported by your GPU. Set `OLLAMA_DEBUG=1` in the server environment to see tensor assignment.

GiteaMirror commented

2026-04-29 03:33:37 -05:00

@bitcandy commented on GitHub (May 12, 2025):

@rick-github
https://pastebin.com/tViT4UdL

I see, 48 loaded to gpu, last one is at CPU... right? :-( seems log was right about 1,5 gb part at CPU
Is this mean that last tensor not supported by my GPUs? Or something other?

@bitcandy commented on GitHub (May 12, 2025): @rick-github https://pastebin.com/tViT4UdL I see, 48 loaded to gpu, last one is at CPU... right? :-( seems log was right about 1,5 gb part at CPU Is this mean that last tensor not supported by my GPUs? Or something other?

GiteaMirror commented

2026-04-29 03:33:38 -05:00

@rick-github commented on GitHub (May 12, 2025):

May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU

token_embd.weight is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.

@rick-github commented on GitHub (May 12, 2025): ``` May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU ``` `token_embd.weight` is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.

GiteaMirror commented

2026-04-29 03:33:39 -05:00

@jessegross commented on GitHub (May 12, 2025):

May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU
token_embd.weight is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.

We always load the input layer on the CPU because the CUDA backend doesn't support some of the required operations on quantized tensors. As you say, the performance impact is minimal.

@jessegross commented on GitHub (May 12, 2025): > ``` > May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU > ``` > > `token_embd.weight` is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear. We always load the input layer on the CPU because the CUDA backend doesn't support some of the required operations on quantized tensors. As you say, the performance impact is minimal.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#53527