[GH-ISSUE #8850] [QUESTION] Why is gpu not using full power or mid to 80% while processing requests ? #31497

New Issue

GiteaMirror · 2026-04-22T11:57:30-05:00

GiteaMirror commented

2026-04-22 11:57:30 -05:00

Originally created by @Greatz08 on GitHub (Feb 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8850

I dont know why but i am consistently witnessing that full model is loaded in gpu but it is not using power full gpu power to process things faster, so is there something i am missing ? If yes then do guide me to fix it. I am not thinking to force my NVIDIA RTX 4060 GPU with nvidia-smi commands because it wont be optimized setting in my opinion.

Originally created by @Greatz08 on GitHub (Feb 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8850 I dont know why but i am consistently witnessing that full model is loaded in gpu but it is not using power full gpu power to process things faster, so is there something i am missing ? If yes then do guide me to fix it. I am not thinking to force my NVIDIA RTX 4060 GPU with nvidia-smi commands because it wont be optimized setting in my opinion.

GiteaMirror closed this issue

2026-04-22 11:57:31 -05:00

GiteaMirror commented

2026-04-22 11:57:32 -05:00

@rick-github commented on GitHub (Feb 5, 2025):

Server logs will aid in debugging.

@rick-github commented on GitHub (Feb 5, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.

GiteaMirror commented

2026-04-22 11:57:32 -05:00

@ALLMI78 commented on GitHub (Feb 5, 2025):

I'm just a newbie, so if Rick says something, it's better to follow his instructions because he's a pro here. But I have 1-2 tips:

Pay attention to the following lines in the log (click right on icon, use "view logs"):

llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 49/49 layers to GPU

Check GPU load with "ollama ps" while the model is running. The values there differ from what you see in the Windows Task Manager.
The parameter num_gpu controls this. I set it to 100 for my 4060, but I also monitor myself to see if that works.
If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits.

Hope this helps. Otherwise, you'll have to wait until Rick has more time, but he'll need your log files then.

@ALLMI78 commented on GitHub (Feb 5, 2025): I'm just a newbie, so if Rick says something, it's better to follow his instructions because he's a pro here. But I have 1-2 tips: 1. Pay attention to the following lines in the log (click right on icon, use "view logs"): ``` llm_load_tensors: offloading 48 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 49/49 layers to GPU ``` 2. Check GPU load with "ollama ps" while the model is running. The values there differ from what you see in the Windows Task Manager. 3. The parameter `num_gpu` controls this. I set it to 100 for my 4060, but I also monitor myself to see if that works. 4. If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (`offloaded 49/49 layers to GPU`). If your model is too large, try lowering the quantization by 1-2 steps until it fits. Hope this helps. Otherwise, you'll have to wait until Rick has more time, but he'll need your log files then.

GiteaMirror commented

2026-04-22 11:57:33 -05:00

@rick-github commented on GitHub (Feb 5, 2025):

If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits.

Maximizing the number of layers on the GPU gets better performance. Because of the vagaries of model architecture, ollama sometimes under estimates how much of the model can be offloaded, and so doesn't use as much VRAM as it could. In those cases, overriding by manually setting num_gpu can improve performance. However, there are two pitfalls here. First, if your GPU doesn't support shared memory, there is a risk of over-allocating VRAM and the runner crashing with an Out Of Memory (OOM) failure. Second. if your GPU does support shared memory and it's enabled (default on WIndows, Linux users need to set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1), loading too many layers onto the GPU will cause some layers to allocated in system RAM and there will be a significant performance penalty.

@rick-github commented on GitHub (Feb 5, 2025): > If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits. Maximizing the number of layers on the GPU gets better performance. Because of the vagaries of model architecture, ollama sometimes under estimates how much of the model can be offloaded, and so doesn't use as much VRAM as it could. In those cases, overriding by manually setting `num_gpu` can improve performance. However, there are two pitfalls here. First, if your GPU doesn't support shared memory, there is a risk of over-allocating VRAM and the runner crashing with an Out Of Memory (OOM) failure. Second. if your GPU does support shared memory and it's enabled (default on WIndows, Linux users need to set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`), loading too many layers onto the GPU will cause some layers to allocated in system RAM and there will be a significant [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).

GiteaMirror commented

2026-04-22 11:57:33 -05:00

@ALLMI78 commented on GitHub (Feb 5, 2025):

Thanks rick, can you explain how use_mmap and use_mlock influence this?

My current settings are:

use_mmap = false
use_mlock = true

I alternate between 2 models, and with the current settings, it works without the two models being repeatedly loaded from the SSD.

I don't know how it works or why, because my models can't both fit into my 16 GB of VRAM at the same time, but this way it works...

If I change something (use_mmap =true), both still run, but they are then continuously loaded from the SSD again.

@ALLMI78 commented on GitHub (Feb 5, 2025): Thanks rick, can you explain how **use_mmap** and **use_mlock** influence this? My current settings are: use_mmap = false use_mlock = true I alternate between 2 models, and with the current settings, it works without the two models being repeatedly loaded from the SSD. I don't know how it works or why, because my models can't both fit into my 16 GB of VRAM at the same time, but this way it works... If I change something (use_mmap =true), both still run, but they are then continuously loaded from the SSD again.

GiteaMirror commented

2026-04-22 11:57:34 -05:00

@rick-github commented on GitHub (Feb 7, 2025):

mmap and mlock are operating system features that allow a program to have finer control over how memory is managed. These features only apply to system RAM, if the model is fully loaded in VRAM they have no effect.

use_mmap causes the model to be mapped in to the virtual address space of the process, rather than the process having to use physical RAM to hold the model weights. In theory this leaves more RAM available for things like context buffer and other memory allocations, while the model weights are read from disk as required. In practice, reading model weights results in them being stored in the page cache, potentially causing swapping and page thrashing as the CPU processes the weights into the context buffer. However, because it's a read only operation, mmaping a large model is usually more efficient than loading it in to swap.

use_mlock locks memory pages into physical memory so that they don't get paged out when the operating system decides it needs to make space for something. Because the pages are always resident, inference is faster, since the model weights don't need to be paged in from swap.

In your case, if you have "use_mmap":false and "use_mlock":true, then the model weights are being loaded into system RAM and locked in place to prevent swapping. When you set "use_mmap":true, the model is being read from SSD rather than being loaded into system RAM.

@rick-github commented on GitHub (Feb 7, 2025): [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html) and [`mlock`](https://man7.org/linux/man-pages/man2/mlock.2.html) are operating system features that allow a program to have finer control over how memory is managed. These features only apply to system RAM, if the model is fully loaded in VRAM they have no effect. `use_mmap` causes the model to be mapped in to the virtual address space of the process, rather than the process having to use physical RAM to hold the model weights. In theory this leaves more RAM available for things like context buffer and other memory allocations, while the model weights are read from disk as required. In practice, reading model weights results in them being stored in the page cache, potentially causing swapping and page thrashing as the CPU processes the weights into the context buffer. However, because it's a read only operation, mmaping a large model is usually more efficient than loading it in to swap. `use_mlock` locks memory pages into physical memory so that they don't get paged out when the operating system decides it needs to make space for something. Because the pages are always resident, inference is faster, since the model weights don't need to be paged in from swap. In your case, if you have `"use_mmap":false` and `"use_mlock":true`, then the model weights are being loaded into system RAM and locked in place to prevent swapping. When you set `"use_mmap":true`, the model is being read from SSD rather than being loaded into system RAM.

GiteaMirror commented

2026-04-22 11:57:35 -05:00

@Greatz08 commented on GitHub (Feb 7, 2025):

@rick-github
Logs of 32B model which i tried to run under 8GB VRAM. Used Max quantized version which was 10GB in size.

-- Boot 9db74ba14523486dbc56ce8d7cd65f48 --
Feb 07 19:24:34 AIbo systemd[1]: Started Ollama Service.
Feb 07 19:24:34 AIbo ollama[3046]: 2025/02/07 19:24:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.744+05:30 level=INFO source=images.go:753 msg="total blobs: 14"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1116644103/runners
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v12]"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.257+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:400 msg="amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:323 msg="unable to verify rocm library, will use cpu" error="no suitable rocm found, falling back to CPU"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=INFO source=types.go:107 msg="inference compute" id=GPU-dfc809ba-80dc-48e0-0673-4e7c6101a035 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4060 Laptop GPU" total="7.6 GiB" available="7.6 GiB"
Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 26.159µs | 127.0.0.1 | HEAD "/"
Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 16.235891ms | 127.0.0.1 | POST "/api/show"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=server.go:103 msg="system memory" total="14.3 GiB" free="9.2 GiB" free_swap="0 B"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1116644103/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --threads 4 --no-mmap --parallel 1 --port 45853"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] build info | build=3670 commit="5ef211c5d" tid="140457483812864" timestamp=1738936479
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140457483812864" timestamp=1738936479 total_threads=16
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45853" tid="140457483812864" timestamp=1738936479
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 (version GGUF V3 (latest))
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 0: general.architecture str = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 1: general.type str = model
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 2: general.name str = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B P...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 3: general.version str = v0.1
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 4: general.finetune str = Preview
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 5: general.basename str = FuseO1-DeepSeekR1-Qwen2.5-Coder
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 6: general.size_label str = 32B
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 7: general.license str = apache-2.0
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 8: qwen2.block_count u32 = 64
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 9: qwen2.context_length u32 = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 10: qwen2.embedding_length u32 = 5120
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 11: qwen2.feed_forward_length u32 = 27648
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 12: qwen2.attention.head_count u32 = 40
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 13: qwen2.attention.head_count_kv u32 = 8
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 14: qwen2.rope.freq_base f32 = 1000000.000000
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 15: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 151646
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151643
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 26: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 27: general.quantization_version u32 = 2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 28: general.file_type u32 = 20
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/FuseO1-DeepSeekR1-Qwen2.5...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 448
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type f32: 321 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q2_K: 9 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q4_K: 64 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q5_K: 1 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type iq2_xs: 376 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: special tokens cache size = 22
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: token to piece cache size = 0.9310 MB
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: format = GGUF V3 (latest)
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: arch = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab type = BPE
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_vocab = 152064
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_merges = 151387
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab_only = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_train = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd = 5120
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_layer = 64
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head = 40
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head_kv = 8
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_rot = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_swa = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_k = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_v = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_gqa = 5
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_k_gqa = 1024
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_v_gqa = 1024
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_eps = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_logit_scale = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ff = 27648
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert_used = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: causal attn = 1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: pooling type = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope type = 2
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope scaling = linear
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_base_train = 1000000.0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_scale_train = 1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope_finetuned = unknown
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_conv = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_inner = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_state = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_rank = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model type = ?B
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model ftype = IQ2_XS - 2.3125 bpw
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model params = 32.76 B
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model size = 9.27 GiB (2.43 BPW)
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: general.name = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B Preview v0.1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: BOS token = 151646 '<｜begin▁of▁sentence｜>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: EOS token = 151643 '<｜end▁of▁sentence｜>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: PAD token = 151643 '<｜end▁of▁sentence｜>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: LF token = 148848 'ÄĬ'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: max token length = 256
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: found 1 CUDA devices:
Feb 07 19:24:39 AIbo ollama[3046]: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.664+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_tensors: ggml ctx size = 0.68 MiB
Feb 07 19:24:40 AIbo ollama[3046]: ggml_cuda_host_malloc: failed to allocate 3784.96 MiB of pinned memory: invalid argument
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloading 42 repeating layers to GPU
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloaded 42/65 layers to GPU
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CPU buffer size = 3784.96 MiB
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CUDA0 buffer size = 5705.60 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ctx = 2048
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_batch = 512
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ubatch = 512
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: flash_attn = 0
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_base = 1000000.0
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_scale = 1
Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA_Host KV buffer size = 176.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA0 KV buffer size = 336.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB
Feb 07 19:24:43 AIbo ollama[3253]: [1738936483] warming up the model with an empty run
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA0 compute buffer size = 817.47 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph nodes = 2246
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph splits = 312
Feb 07 19:24:44 AIbo ollama[3253]: INFO [main] model loaded | tid="140457483812864" timestamp=1738936484
Feb 07 19:24:44 AIbo ollama[3046]: time=2025-02-07T19:24:44.428+05:30 level=INFO source=server.go:626 msg="llama runner started in 5.02 seconds"
Feb 07 19:24:44 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:44 | 200 | 5.159778121s | 127.0.0.1 | POST "/api/generate"
Feb 07 19:24:56 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:56 | 200 | 10.472921533s | 127.0.0.1 | POST "/api/chat"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 31.813µs | 127.0.0.1 | HEAD "/"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.813277ms | 127.0.0.1 | POST "/api/show"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.533423ms | 127.0.0.1 | POST "/api/generate"
Feb 07 19:26:59 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:26:59 | 200 | 1m42s | 127.0.0.1 | POST "/api/chat"

I know that it will auto allocate remaining model weights to ram and it did i could see from nvtop BUT when i was observing its generation i was seeing the graph and gpu usage which was consistently shifting from 10% to 30% not exceeding 50%+ altho it should i guess ? when max size is loaded in VRAM , i thought i could get better than 3.36 t/s if it had used more gpu power.

So what all things or params i can or i should set to get max possible performance in this type of situation where i cant fully load model but max model weights are in VRAM only and my ram is also decent DDR5 5600 .

You also mentioned about mlock thing so how can i set that and will it help in getting better performance in my scenario.

@Greatz08 commented on GitHub (Feb 7, 2025): @rick-github Logs of 32B model which i tried to run under 8GB VRAM. Used Max quantized version which was 10GB in size. > -- Boot 9db74ba14523486dbc56ce8d7cd65f48 -- Feb 07 19:24:34 AIbo systemd[1]: Started Ollama Service. Feb 07 19:24:34 AIbo ollama[3046]: 2025/02/07 19:24:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.744+05:30 level=INFO source=images.go:753 msg="total blobs: 14" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1116644103/runners Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v12]" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.257+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:400 msg="amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:323 msg="unable to verify rocm library, will use cpu" error="no suitable rocm found, falling back to CPU" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=INFO source=types.go:107 msg="inference compute" id=GPU-dfc809ba-80dc-48e0-0673-4e7c6101a035 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4060 Laptop GPU" total="7.6 GiB" available="7.6 GiB" Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 26.159µs | 127.0.0.1 | HEAD "/" Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 16.235891ms | 127.0.0.1 | POST "/api/show" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=server.go:103 msg="system memory" total="14.3 GiB" free="9.2 GiB" free_swap="0 B" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1116644103/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --threads 4 --no-mmap --parallel 1 --port 45853" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1 Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] build info | build=3670 commit="5ef211c5d" tid="140457483812864" timestamp=1738936479 Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140457483812864" timestamp=1738936479 total_threads=16 Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45853" tid="140457483812864" timestamp=1738936479 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 (version GGUF V3 (latest)) Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 0: general.architecture str = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 1: general.type str = model Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 2: general.name str = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B P... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 3: general.version str = v0.1 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 4: general.finetune str = Preview Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 5: general.basename str = FuseO1-DeepSeekR1-Qwen2.5-Coder Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 6: general.size_label str = 32B Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 7: general.license str = apache-2.0 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 8: qwen2.block_count u32 = 64 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 9: qwen2.context_length u32 = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 10: qwen2.embedding_length u32 = 5120 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 11: qwen2.feed_forward_length u32 = 27648 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 12: qwen2.attention.head_count u32 = 40 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 13: qwen2.attention.head_count_kv u32 = 8 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 14: qwen2.rope.freq_base f32 = 1000000.000000 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 15: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 151646 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151643 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 26: tokenizer.chat_template str = {% if not add_generation_prompt is de... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 27: general.quantization_version u32 = 2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 28: general.file_type u32 = 20 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/FuseO1-DeepSeekR1-Qwen2.5... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 448 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type f32: 321 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q2_K: 9 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q4_K: 64 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q5_K: 1 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type iq2_xs: 376 tensors Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: special tokens cache size = 22 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: token to piece cache size = 0.9310 MB Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: format = GGUF V3 (latest) Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: arch = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab type = BPE Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_vocab = 152064 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_merges = 151387 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab_only = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_train = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd = 5120 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_layer = 64 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head = 40 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head_kv = 8 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_rot = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_swa = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_k = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_v = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_gqa = 5 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_k_gqa = 1024 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_v_gqa = 1024 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_eps = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_logit_scale = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ff = 27648 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert_used = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: causal attn = 1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: pooling type = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope type = 2 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope scaling = linear Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_base_train = 1000000.0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_scale_train = 1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope_finetuned = unknown Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_conv = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_inner = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_state = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_rank = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model type = ?B Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model ftype = IQ2_XS - 2.3125 bpw Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model params = 32.76 B Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model size = 9.27 GiB (2.43 BPW) Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: general.name = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B Preview v0.1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: BOS token = 151646 '<｜begin▁of▁sentence｜>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: EOS token = 151643 '<｜end▁of▁sentence｜>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: PAD token = 151643 '<｜end▁of▁sentence｜>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: LF token = 148848 'ÄĬ' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: max token length = 256 Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: found 1 CUDA devices: Feb 07 19:24:39 AIbo ollama[3046]: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.664+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Feb 07 19:24:39 AIbo ollama[3046]: llm_load_tensors: ggml ctx size = 0.68 MiB Feb 07 19:24:40 AIbo ollama[3046]: ggml_cuda_host_malloc: failed to allocate 3784.96 MiB of pinned memory: invalid argument Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloading 42 repeating layers to GPU Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloaded 42/65 layers to GPU Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CPU buffer size = 3784.96 MiB Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CUDA0 buffer size = 5705.60 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ctx = 2048 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_batch = 512 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ubatch = 512 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: flash_attn = 0 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_base = 1000000.0 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_scale = 1 Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA_Host KV buffer size = 176.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA0 KV buffer size = 336.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB Feb 07 19:24:43 AIbo ollama[3253]: [1738936483] warming up the model with an empty run Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA0 compute buffer size = 817.47 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph nodes = 2246 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph splits = 312 Feb 07 19:24:44 AIbo ollama[3253]: INFO [main] model loaded | tid="140457483812864" timestamp=1738936484 Feb 07 19:24:44 AIbo ollama[3046]: time=2025-02-07T19:24:44.428+05:30 level=INFO source=server.go:626 msg="llama runner started in 5.02 seconds" Feb 07 19:24:44 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:44 | 200 | 5.159778121s | 127.0.0.1 | POST "/api/generate" Feb 07 19:24:56 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:56 | 200 | 10.472921533s | 127.0.0.1 | POST "/api/chat" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 31.813µs | 127.0.0.1 | HEAD "/" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.813277ms | 127.0.0.1 | POST "/api/show" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.533423ms | 127.0.0.1 | POST "/api/generate" Feb 07 19:26:59 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:26:59 | 200 | 1m42s | 127.0.0.1 | POST "/api/chat" I know that it will auto allocate remaining model weights to ram and it did i could see from nvtop BUT when i was observing its generation i was seeing the graph and gpu usage which was consistently shifting from 10% to 30% not exceeding 50%+ altho it should i guess ? when max size is loaded in VRAM , i thought i could get better than 3.36 t/s if it had used more gpu power. So what all things or params i can or i should set to get max possible performance in this type of situation where i cant fully load model but max model weights are in VRAM only and my ram is also decent DDR5 5600 . You also mentioned about mlock thing so how can i set that and will it help in getting better performance in my scenario.

GiteaMirror commented

2026-04-22 11:57:36 -05:00

@rick-github commented on GitHub (Feb 7, 2025):

Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"

The model doesn't fit in VRAM, 23 of the layers run in system RAM where the CPU does the inference. Inference happens per layer, so the first 42 layers are processed very quickly by the GPU while the CPU waits. Then the CPU processes its 23 layers more slowly while the GPU waits. This is why the average utilization of GPU/CPU is low, part of the time they are waiting for the other processor.

The only way to get better token generation rate is to fit the model in VRAM. That means more VRAM, a more quantized model, or a different model. If only a small amount of the model was in system RAM there's a trick to giving the GPU full access, but that won't work in this case.

ollama is loading 7.4G in VRAM and 3.6G in system RAM. You have 9.2G free RAM so unless you are running some other big processes, there will be no paging of model weights to swap and use_mlock won't make a difference.

@rick-github commented on GitHub (Feb 7, 2025): ``` Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" ``` The model doesn't fit in VRAM, 23 of the layers run in system RAM where the CPU does the inference. Inference happens per layer, so the first 42 layers are processed very quickly by the GPU while the CPU waits. Then the CPU processes its 23 layers more slowly while the GPU waits. This is why the average utilization of GPU/CPU is low, part of the time they are waiting for the other processor. The only way to get better token generation rate is to fit the model in VRAM. That means more VRAM, a more quantized model, or a different model. If only a small amount of the model was in system RAM there's a [trick](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) to giving the GPU full access, but that won't work in this case. ollama is loading 7.4G in VRAM and 3.6G in system RAM. You have 9.2G free RAM so unless you are running some other big processes, there will be no paging of model weights to swap and `use_mlock` won't make a difference.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#31497