[GH-ISSUE #10670] On multi-GPU systems, the context should be loaded into the GPU with the most available memory. #53527

Closed
opened 2026-04-29 03:31:28 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @bitcandy on GitHub (May 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10670

In my multi-GPU system, the gemma3:12b-it-qat model begins loading onto the GPU with the least available memory, leading to a 500 error (memory issue) when using a context window exceeding 1,000 tokens (anywhere between 1,000 and 100,000 tokens). After some investigation, I found that the model primarily uses a GPU with just 6GB of free memory, even though another GPU has 10GB free—or even about 5GB free after splitting the model across multiple GPUs—which should be sufficient for the context window.

My proposal is that on multi-GPU systems, once the model has loaded, the context should be placed on the GPU with the most available memory. It appears that Ollama could benefit from an additional check for free memory before deciding where to load the context window.

P.S. If context is loaded onto all GPUs by design (I don't know), then the Ollama loading system should utilize more memory from the GPU with the largest available VRAM

ps. gemma3:12b work well with any context window at the same conditions....

Originally created by @bitcandy on GitHub (May 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10670 In my multi-GPU system, the gemma3:12b-it-qat model begins loading onto the GPU with the least available memory, leading to a 500 error (memory issue) when using a context window exceeding 1,000 tokens (anywhere between 1,000 and 100,000 tokens). After some investigation, I found that the model primarily uses a GPU with just 6GB of free memory, even though another GPU has 10GB free—or even about 5GB free after splitting the model across multiple GPUs—which should be sufficient for the context window. My proposal is that on multi-GPU systems, once the model has loaded, the context should be placed on the GPU with the most available memory. It appears that Ollama could benefit from an additional check for free memory before deciding where to load the context window. P.S. If context is loaded onto all GPUs by design (I don't know), then the Ollama loading system should utilize more memory from the GPU with the largest available VRAM ps. gemma3:12b work well with any context window at the same conditions....
GiteaMirror added the feature request label 2026-04-29 03:31:28 -05:00
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

ollama estimates memory requirements and then round-robin assigns layers to the available devices. If a device is full, layer assignment is done on the remaining devices until all the layers are assigned. This sounds more like an OOM issue due to inaccuracies in memory estimation. Server logs will show details of memory estimation. Generic ways for dealing with OOM can be found here.

<!-- gh-comment-id:2872040269 --> @rick-github commented on GitHub (May 12, 2025): ollama estimates memory requirements and then round-robin assigns layers to the available devices. If a device is full, layer assignment is done on the remaining devices until all the layers are assigned. This sounds more like an OOM issue due to inaccuracies in memory estimation. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show details of memory estimation. Generic ways for dealing with OOM can be found [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

@rick-github thank you, your answer help a lot to find new ways for optimization and now I can run gemma3:12b-it-qat at the same conditions with this
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_NUM_PARALLEL=1"
(I think NUM_PARALLEL helped and not sure that flash attention actually worked (at the log I see only this
May 12 11:33:43 m3pc ollama[173536]: time=2025-05-12T11:33:43.747Z level=INFO source=server.go:186 msg="enabling flash attention"
and nothing else about flash.
)

But i still can't understand why it does not load almost everything to GTX 1080 Ti at least up to 10000 mb, before use other cards? And the second question why it use CPU, when the system still have so big amount of free VRAM?

ollama ps
NAME                 ID              SIZE     PROCESSOR          UNTIL               
gemma3:12b-it-qat    5d4fa005e7bb    26 GB    24%/76% CPU/GPU    59 minutes from now  

| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1070        Off |   00000000:01:00.0 Off |                  N/A |
| 53%   69C    P2            189W /  195W |    5135MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:03:00.0 Off |                  N/A |
| 54%   70C    P2            277W /  280W |    7221MiB /  11264MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce GTX 1070        Off |   00000000:05:00.0 Off |                  N/A |
| 64%   74C    P2            193W /  195W |    4285MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2190MiB |
|    0   N/A  N/A    181011      C   /usr/local/bin/ollama                        2878MiB |
|    1   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2306MiB |
|    1   N/A  N/A    181011      C   /usr/local/bin/ollama                        4886MiB |
|    2   N/A  N/A      1184      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A     84041      C   /home/user/m1/miniz/miniZ                    2190MiB |
|    2   N/A  N/A    181011      C   /usr/local/bin/ollama                        2086MiB |
+-----------------------------------------------------------------------------------------+


May 12 12:07:07 m3pc systemd[1]: Started Ollama Service.
May 12 12:07:07 m3pc ollama[180943]: 2025/05/12 12:07:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.378Z level=INFO source=images.go:463 msg="total blobs: 84"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.380Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.6 GiB"
May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 |      64.303µs |       127.0.0.1 | HEAD     "/"
May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 |    2.333068ms |       127.0.0.1 | GET      "/api/ps"
May 12 12:07:47 m3pc ollama[180943]: time=2025-05-12T12:07:47.885Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.406Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.440Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:50 m3pc ollama[180943]: time=2025-05-12T12:07:50.801Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="4.5 MiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:186 msg="enabling flash attention"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.299Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.301Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 41 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 40001"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=sched.go:452 msg="loaded runners" count=1
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.322Z level=INFO source=runner.go:851 msg="starting ollama engine"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.323Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:40001"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.374Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: found 3 CUDA devices:
May 12 12:07:51 m3pc ollama[180943]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]:   Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]:   Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.560Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.674Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="5.4 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="2.5 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.5 GiB"
May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="842.1 MiB"
May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 |     165.305µs |       127.0.0.1 | HEAD     "/"
May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 |     102.095µs |       127.0.0.1 | GET      "/api/ps"
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.243Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="563.8 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="135.3 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="135.3 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.100Z level=INFO source=server.go:628 msg="llama runner started in 7.79 seconds"
May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 |      25.018µs |       127.0.0.1 | HEAD     "/"
May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 |       26.89µs |       127.0.0.1 | GET      "/api/ps"


<!-- gh-comment-id:2872320552 --> @bitcandy commented on GitHub (May 12, 2025): @rick-github thank you, your answer help a lot to find new ways for optimization and now I can run gemma3:12b-it-qat at the same conditions with this Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_NUM_PARALLEL=1" (I think NUM_PARALLEL helped and not sure that flash attention actually worked (at the log I see only this May 12 11:33:43 m3pc ollama[173536]: time=2025-05-12T11:33:43.747Z level=INFO source=server.go:186 msg="enabling flash attention" and nothing else about flash. ) But i still can't understand why it does not load almost everything to GTX 1080 Ti at least up to 10000 mb, before use other cards? And the second question why it use CPU, when the system still have so big amount of free VRAM? ``` ollama ps NAME ID SIZE PROCESSOR UNTIL gemma3:12b-it-qat 5d4fa005e7bb 26 GB 24%/76% CPU/GPU 59 minutes from now | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A | | 53% 69C P2 189W / 195W | 5135MiB / 8192MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce GTX 1080 Ti Off | 00000000:03:00.0 Off | N/A | | 54% 70C P2 277W / 280W | 7221MiB / 11264MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce GTX 1070 Off | 00000000:05:00.0 Off | N/A | | 64% 74C P2 193W / 195W | 4285MiB / 8192MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2190MiB | | 0 N/A N/A 181011 C /usr/local/bin/ollama 2878MiB | | 1 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2306MiB | | 1 N/A N/A 181011 C /usr/local/bin/ollama 4886MiB | | 2 N/A N/A 1184 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 84041 C /home/user/m1/miniz/miniZ 2190MiB | | 2 N/A N/A 181011 C /usr/local/bin/ollama 2086MiB | +-----------------------------------------------------------------------------------------+ May 12 12:07:07 m3pc systemd[1]: Started Ollama Service. May 12 12:07:07 m3pc ollama[180943]: 2025/05/12 12:07:07 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.378Z level=INFO source=images.go:463 msg="total blobs: 84" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.380Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.382Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.6 GiB" May 12 12:07:07 m3pc ollama[180943]: time=2025-05-12T12:07:07.887Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 | 64.303µs | 127.0.0.1 | HEAD "/" May 12 12:07:09 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:09 | 200 | 2.333068ms | 127.0.0.1 | GET "/api/ps" May 12 12:07:47 m3pc ollama[180943]: time=2025-05-12T12:07:47.885Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.406Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:48 m3pc ollama[180943]: time=2025-05-12T12:07:48.440Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:50 m3pc ollama[180943]: time=2025-05-12T12:07:50.801Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="4.5 MiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:186 msg="enabling flash attention" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.299Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.301Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.308Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 41 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 40001" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=sched.go:452 msg="loaded runners" count=1 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.309Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.322Z level=INFO source=runner.go:851 msg="starting ollama engine" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.323Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:40001" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.374Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.376Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 12 12:07:51 m3pc ollama[180943]: ggml_cuda_init: found 3 CUDA devices: May 12 12:07:51 m3pc ollama[180943]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.560Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" May 12 12:07:51 m3pc ollama[180943]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.674Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="5.4 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="2.5 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.5 GiB" May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.826Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="842.1 MiB" May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 | 165.305µs | 127.0.0.1 | HEAD "/" May 12 12:07:52 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:07:52 | 200 | 102.095µs | 127.0.0.1 | GET "/api/ps" May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.243Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:07:58 m3pc ollama[180943]: time=2025-05-12T12:07:58.257Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="563.8 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="135.3 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="135.3 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.029Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" May 12 12:07:59 m3pc ollama[180943]: time=2025-05-12T12:07:59.100Z level=INFO source=server.go:628 msg="llama runner started in 7.79 seconds" May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 | 25.018µs | 127.0.0.1 | HEAD "/" May 12 12:09:39 m3pc ollama[180943]: [GIN] 2025/05/12 - 12:09:39 | 200 | 26.89µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload
 library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7
 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB"
 memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]"
 memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB"
 memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"

The ollama server estimates that it can offload 41 of 49 layers. However, since flash attention is enabled, the runner doesn't use as much VRAM as the server estimated, so the VRAM is under-utilized. You can override the estimation by setting num_gpu as described here.

<!-- gh-comment-id:2872344897 --> @rick-github commented on GitHub (May 12, 2025): ``` May 12 12:07:51 m3pc ollama[180943]: time=2025-05-12T12:07:51.252Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.6 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" ``` The ollama server estimates that it can offload 41 of 49 layers. However, since flash attention is enabled, the runner doesn't use as much VRAM as the server estimated, so the VRAM is under-utilized. You can override the estimation by setting `num_gpu` as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

@rick-github with flash attention disabled it is exactly the same estimation and exactly the same high under-utilization of VRAM :-(
About num_gpu... it can make things only worse. I want to load all model to GPUs as I have enough VRAM for that, but with num_gpu=0 it will load everything to cpu... num_gpu=49 also will not actually override the behavior and will not help utilize more VRAM. :-(

<!-- gh-comment-id:2872443515 --> @bitcandy commented on GitHub (May 12, 2025): @rick-github with flash attention disabled it is exactly the same estimation and exactly the same high under-utilization of VRAM :-( About `num_gpu`... it can make things only worse. I want to load all model to GPUs as I have enough VRAM for that, but with `num_gpu=0` it will load everything to cpu... `num_gpu=49` also will not actually override the behavior and will not help utilize more VRAM. :-(
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

Yes, flash attention will make the same estimation because the estimation is done by the ollama server which doesn't know a bout the VRAM savings from flash attention.

num_gpu=49 will override the estimation and utilize more VRAM. If you are finding that it doesn't, server logs may show why.

<!-- gh-comment-id:2872450748 --> @rick-github commented on GitHub (May 12, 2025): Yes, flash attention will make the same estimation because the estimation is done by the ollama server which doesn't know a bout the VRAM savings from flash attention. `num_gpu=49` will override the estimation and utilize more VRAM. If you are finding that it doesn't, server logs may show why.
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

@rick-github
Tried again.
OLLAMA_FLASH_ATTENTION:true
I see at the log that it submit 49 to runner...but without success as said before

ollama ps 
gemma3:12b-it-qat    5d4fa005e7bb    26 GB    24%/76% CPU/GPU    59 minutes from now   

utilization is exactly the same :-(


May 12 12:56:16 m3pc systemd[1]: Started Ollama Service.
May 12 12:56:16 m3pc ollama[195254]: 2025/05/12 12:56:16 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:f8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.588Z level=INFO source=images.go:463 msg="total blobs: 84"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=images.go:470 msg="total unused blobs removed: 0"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)"
May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB"
May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 |      65.812µs |       127.0.0.1 | HEAD     "/"
May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 |     189.582µs |       127.0.0.1 | GET      "/api/ps"
May 12 12:56:49 m3pc ollama[195254]: time=2025-05-12T12:56:49.875Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.392Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.433Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.266Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="2.7 MiB"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:186 msg="enabling flash attention"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=f8_0
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.782Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.784Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=sched.go:452 msg="loaded runners" count=1
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:851 msg="starting ollama engine"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44657"
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.855Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default=""
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default=""
May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40
May 12 12:56:57 m3pc ollama[195254]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: found 3 CUDA devices:
May 12 12:56:57 m3pc ollama[195254]:   Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes
May 12 12:56:57 m3pc ollama[195254]:   Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:56:57 m3pc ollama[195254]:   Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.043Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model"
May 12 12:56:58 m3pc ollama[195254]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.204Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="3.5 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="3.1 GiB"
May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.8 GiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.292Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="354.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="384.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="384.0 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"
May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.850Z level=INFO source=server.go:628 msg="llama runner started in 8.06 seconds"

<!-- gh-comment-id:2872475906 --> @bitcandy commented on GitHub (May 12, 2025): @rick-github Tried again. OLLAMA_FLASH_ATTENTION:true I see at the log that it submit 49 to runner...but without success as said before ``` ollama ps gemma3:12b-it-qat 5d4fa005e7bb 26 GB 24%/76% CPU/GPU 59 minutes from now utilization is exactly the same :-( May 12 12:56:16 m3pc systemd[1]: Started Ollama Service. May 12 12:56:16 m3pc ollama[195254]: 2025/05/12 12:56:16 routes.go:1233: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:f8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/spot/ollama/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.588Z level=INFO source=images.go:463 msg="total blobs: 84" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=images.go:470 msg="total unused blobs removed: 0" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=routes.go:1300 msg="Listening on [::]:11434 (version 0.6.8)" May 12 12:56:16 m3pc ollama[195254]: time=2025-05-12T12:56:16.590Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="8.5 GiB" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:56:17 m3pc ollama[195254]: time=2025-05-12T12:56:17.093Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxx library=cuda variant=v12 compute=6.1 driver=12.4 name="NVIDIA GeForce GTX 1070" total="7.9 GiB" available="5.7 GiB" May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 | 65.812µs | 127.0.0.1 | HEAD "/" May 12 12:56:20 m3pc ollama[195254]: [GIN] 2025/05/12 - 12:56:20 | 200 | 189.582µs | 127.0.0.1 | GET "/api/ps" May 12 12:56:49 m3pc ollama[195254]: time=2025-05-12T12:56:49.875Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.392Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:50 m3pc ollama[195254]: time=2025-05-12T12:56:50.433Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.266Z level=INFO source=server.go:106 msg="system memory" total="62.7 GiB" free="19.6 GiB" free_swap="2.7 MiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:186 msg="enabling flash attention" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=WARN source=server.go:194 msg="kv cache type not supported by model" type=f8_0 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.782Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.784Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.791Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=sched.go:452 msg="loaded runners" count=1 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:589 msg="waiting for llama runner to start responding" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server not responding" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:851 msg="starting ollama engine" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.806Z level=INFO source=runner.go:914 msg="Server listening on 127.0.0.1:44657" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.855Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.name default="" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=WARN source=ggml.go:152 msg="key not found" key=general.description default="" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.858Z level=INFO source=ggml.go:72 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1065 num_key_values=40 May 12 12:56:57 m3pc ollama[195254]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 12 12:56:57 m3pc ollama[195254]: ggml_cuda_init: found 3 CUDA devices: May 12 12:56:57 m3pc ollama[195254]: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes May 12 12:56:57 m3pc ollama[195254]: Device 1: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:56:57 m3pc ollama[195254]: Device 2: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.043Z level=INFO source=server.go:623 msg="waiting for server to become available" status="llm server loading model" May 12 12:56:58 m3pc ollama[195254]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.204Z level=INFO source=ggml.go:103 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA2 size="3.5 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA0 size="3.1 GiB" May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CUDA1 size="1.8 GiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.292Z level=WARN source=ggml.go:152 msg="key not found" key=tokenizer.ggml.add_eot_token default=false May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.local.freq_base default=10000 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.rope.freq_scale default=1 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.302Z level=WARN source=ggml.go:152 msg="key not found" key=gemma3.mm_tokens_per_image default=256 May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="354.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="384.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="384.0 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.805Z level=INFO source=ggml.go:553 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB" May 12 12:57:05 m3pc ollama[195254]: time=2025-05-12T12:57:05.850Z level=INFO source=server.go:628 msg="llama runner started in 8.06 seconds" ```
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload
 library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7
 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB"
 memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]"
 memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB"
 memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB"

May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0
 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657"

All 49 layers are offloaded to the GPU. The output of ollama ps is inaccurate because num_gpu was overridden.

<!-- gh-comment-id:2872499334 --> @rick-github commented on GitHub (May 12, 2025): ``` May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.737Z level=INFO source=server.go:139 msg=offload library=cuda layers.requested=49 layers.model=49 layers.offload=41 layers.split=21,13,7 memory.available="[8.5 GiB 5.7 GiB 5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="24.7 GiB" memory.required.partial="18.8 GiB" memory.required.kv="4.1 GiB" memory.required.allocations="[8.0 GiB 5.6 GiB 5.3 GiB]" memory.weights.total="7.5 GiB" memory.weights.repeating="5.6 GiB" memory.weights.nonrepeating="1.9 GiB" memory.graph.full="2.4 GiB" memory.graph.partial="2.4 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" May 12 12:56:57 m3pc ollama[195254]: time=2025-05-12T12:56:57.792Z level=INFO source=server.go:410 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/user/spot/ollama/blobs/sha256-1fb99eda86dc48a736567406253769fdc75f01e65cde7c65fa5563e4bdf156e0 --ctx-size 60000 --batch-size 512 --n-gpu-layers 49 --threads 7 --flash-attn --no-mmap --parallel 1 --tensor-split 21,13,7 --port 44657" ``` All 49 layers are offloaded to the GPU. The output of `ollama ps` is inaccurate because `num_gpu` was overridden.
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

Thank you very much. Sorry that I trouble you with my questions.

Is it other way to check actual cpu /gpu utilization expect ollama ps and the speed of execution?

p.s. also hope it will be possible to estimate layers better with ollama in the future :-)

<!-- gh-comment-id:2872546321 --> @bitcandy commented on GitHub (May 12, 2025): Thank you very much. Sorry that I trouble you with my questions. Is it other way to check actual cpu /gpu utilization expect `ollama ps` and the speed of execution? p.s. also hope it will be possible to estimate layers better with ollama in the future :-)
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

Is it other way to check actual cpu /gpu utilization expect ollama ps and the speed of execution?

Currently the logs contain the most accurate information. The inaccurate ollama ps output is an open issue in #7597

p.s. also hope it will be possible to estimate layers better with ollama in the future :-)

#6160

<!-- gh-comment-id:2872609732 --> @rick-github commented on GitHub (May 12, 2025): > Is it other way to check actual cpu /gpu utilization expect `ollama ps` and the speed of execution? Currently the logs contain the most accurate information. The inaccurate `ollama ps` output is an open issue in #7597 > p.s. also hope it will be possible to estimate layers better with ollama in the future :-) #6160
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

@rick-github At my last log present:

May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB"

You said that all 49 layers are offloaded to the GPU. Why it use CPU buffer 1.9 GB ? Is it expected?
Is it the same issue that "ollama don't know" with flash enabled ? Then what tell exactly 100% at the log without mistakes that it run fully at GPU?...

<!-- gh-comment-id:2872627318 --> @bitcandy commented on GitHub (May 12, 2025): @rick-github At my last log present: > May 12 12:56:58 m3pc ollama[195254]: time=2025-05-12T12:56:58.343Z level=INFO source=ggml.go:298 msg="model weights" buffer=CPU size="1.9 GiB" You said that all 49 layers are offloaded to the GPU. Why it use CPU buffer 1.9 GB ? Is it expected? Is it the same issue that "ollama don't know" with flash enabled ? Then what tell exactly 100% at the log without mistakes that it run fully at GPU?...
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

Some tensors may not be supported by your GPU. Set OLLAMA_DEBUG=1 in the server environment to see tensor assignment.

<!-- gh-comment-id:2872678838 --> @rick-github commented on GitHub (May 12, 2025): Some tensors may not be supported by your GPU. Set `OLLAMA_DEBUG=1` in the server environment to see tensor assignment.
Author
Owner

@bitcandy commented on GitHub (May 12, 2025):

@rick-github
https://pastebin.com/tViT4UdL

I see, 48 loaded to gpu, last one is at CPU... right? :-( seems log was right about 1,5 gb part at CPU
Is this mean that last tensor not supported by my GPUs? Or something other?

<!-- gh-comment-id:2872746704 --> @bitcandy commented on GitHub (May 12, 2025): @rick-github https://pastebin.com/tViT4UdL I see, 48 loaded to gpu, last one is at CPU... right? :-( seems log was right about 1,5 gb part at CPU Is this mean that last tensor not supported by my GPUs? Or something other?
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU

token_embd.weight is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.

<!-- gh-comment-id:2872881876 --> @rick-github commented on GitHub (May 12, 2025): ``` May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU ``` `token_embd.weight` is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.
Author
Owner

@jessegross commented on GitHub (May 12, 2025):

May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU

token_embd.weight is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear.

We always load the input layer on the CPU because the CUDA backend doesn't support some of the required operations on quantized tensors. As you say, the performance impact is minimal.

<!-- gh-comment-id:2873422396 --> @jessegross commented on GitHub (May 12, 2025): > ``` > May 12 14:06:16 m3pc ollama[220915]: time=2025-05-12T14:06:16.905Z level=DEBUG source=ggml.go:225 msg="created tensor" name=token_embd.weight shape="[3840 262144]" dtype=1 buffer_type=CPU > ``` > > `token_embd.weight` is not a repeating layer so will not have much of an impact running in a CPU buffer. The reason for choosing CPU over CUDA is not clear. We always load the input layer on the CPU because the CUDA backend doesn't support some of the required operations on quantized tensors. As you say, the performance impact is minimal.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53527