[GH-ISSUE #8850] [QUESTION] Why is gpu not using full power or mid to 80% while processing requests ? #31497

Closed
opened 2026-04-22 11:57:30 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Greatz08 on GitHub (Feb 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8850

I dont know why but i am consistently witnessing that full model is loaded in gpu but it is not using power full gpu power to process things faster, so is there something i am missing ? If yes then do guide me to fix it. I am not thinking to force my NVIDIA RTX 4060 GPU with nvidia-smi commands because it wont be optimized setting in my opinion.

Originally created by @Greatz08 on GitHub (Feb 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8850 I dont know why but i am consistently witnessing that full model is loaded in gpu but it is not using power full gpu power to process things faster, so is there something i am missing ? If yes then do guide me to fix it. I am not thinking to force my NVIDIA RTX 4060 GPU with nvidia-smi commands because it wont be optimized setting in my opinion.
Author
Owner

@rick-github commented on GitHub (Feb 5, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2637308392 --> @rick-github commented on GitHub (Feb 5, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@ALLMI78 commented on GitHub (Feb 5, 2025):

I'm just a newbie, so if Rick says something, it's better to follow his instructions because he's a pro here. But I have 1-2 tips:

  1. Pay attention to the following lines in the log (click right on icon, use "view logs"):
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
  1. Check GPU load with "ollama ps" while the model is running. The values there differ from what you see in the Windows Task Manager.

  2. The parameter num_gpu controls this. I set it to 100 for my 4060, but I also monitor myself to see if that works.

  3. If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits.

Hope this helps. Otherwise, you'll have to wait until Rick has more time, but he'll need your log files then.

<!-- gh-comment-id:2637668200 --> @ALLMI78 commented on GitHub (Feb 5, 2025): I'm just a newbie, so if Rick says something, it's better to follow his instructions because he's a pro here. But I have 1-2 tips: 1. Pay attention to the following lines in the log (click right on icon, use "view logs"): ``` llm_load_tensors: offloading 48 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 49/49 layers to GPU ``` 2. Check GPU load with "ollama ps" while the model is running. The values there differ from what you see in the Windows Task Manager. 3. The parameter `num_gpu` controls this. I set it to 100 for my 4060, but I also monitor myself to see if that works. 4. If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (`offloaded 49/49 layers to GPU`). If your model is too large, try lowering the quantization by 1-2 steps until it fits. Hope this helps. Otherwise, you'll have to wait until Rick has more time, but he'll need your log files then.
Author
Owner

@rick-github commented on GitHub (Feb 5, 2025):

If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits.

Maximizing the number of layers on the GPU gets better performance. Because of the vagaries of model architecture, ollama sometimes under estimates how much of the model can be offloaded, and so doesn't use as much VRAM as it could. In those cases, overriding by manually setting num_gpu can improve performance. However, there are two pitfalls here. First, if your GPU doesn't support shared memory, there is a risk of over-allocating VRAM and the runner crashing with an Out Of Memory (OOM) failure. Second. if your GPU does support shared memory and it's enabled (default on WIndows, Linux users need to set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1), loading too many layers onto the GPU will cause some layers to allocated in system RAM and there will be a significant performance penalty.

<!-- gh-comment-id:2637723425 --> @rick-github commented on GitHub (Feb 5, 2025): > If you want optimal performance, - to my knowledge - all layers should be loaded to the GPU (offloaded 49/49 layers to GPU). If your model is too large, try lowering the quantization by 1-2 steps until it fits. Maximizing the number of layers on the GPU gets better performance. Because of the vagaries of model architecture, ollama sometimes under estimates how much of the model can be offloaded, and so doesn't use as much VRAM as it could. In those cases, overriding by manually setting `num_gpu` can improve performance. However, there are two pitfalls here. First, if your GPU doesn't support shared memory, there is a risk of over-allocating VRAM and the runner crashing with an Out Of Memory (OOM) failure. Second. if your GPU does support shared memory and it's enabled (default on WIndows, Linux users need to set `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`), loading too many layers onto the GPU will cause some layers to allocated in system RAM and there will be a significant [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@ALLMI78 commented on GitHub (Feb 5, 2025):

Thanks rick, can you explain how use_mmap and use_mlock influence this?

My current settings are:

use_mmap = false
use_mlock = true

I alternate between 2 models, and with the current settings, it works without the two models being repeatedly loaded from the SSD.

I don't know how it works or why, because my models can't both fit into my 16 GB of VRAM at the same time, but this way it works...

If I change something (use_mmap =true), both still run, but they are then continuously loaded from the SSD again.

<!-- gh-comment-id:2637895057 --> @ALLMI78 commented on GitHub (Feb 5, 2025): Thanks rick, can you explain how **use_mmap** and **use_mlock** influence this? My current settings are: use_mmap = false use_mlock = true I alternate between 2 models, and with the current settings, it works without the two models being repeatedly loaded from the SSD. I don't know how it works or why, because my models can't both fit into my 16 GB of VRAM at the same time, but this way it works... If I change something (use_mmap =true), both still run, but they are then continuously loaded from the SSD again.
Author
Owner

@rick-github commented on GitHub (Feb 7, 2025):

mmap and mlock are operating system features that allow a program to have finer control over how memory is managed. These features only apply to system RAM, if the model is fully loaded in VRAM they have no effect.

use_mmap causes the model to be mapped in to the virtual address space of the process, rather than the process having to use physical RAM to hold the model weights. In theory this leaves more RAM available for things like context buffer and other memory allocations, while the model weights are read from disk as required. In practice, reading model weights results in them being stored in the page cache, potentially causing swapping and page thrashing as the CPU processes the weights into the context buffer. However, because it's a read only operation, mmaping a large model is usually more efficient than loading it in to swap.

use_mlock locks memory pages into physical memory so that they don't get paged out when the operating system decides it needs to make space for something. Because the pages are always resident, inference is faster, since the model weights don't need to be paged in from swap.

In your case, if you have "use_mmap":false and "use_mlock":true, then the model weights are being loaded into system RAM and locked in place to prevent swapping. When you set "use_mmap":true, the model is being read from SSD rather than being loaded into system RAM.

<!-- gh-comment-id:2642603979 --> @rick-github commented on GitHub (Feb 7, 2025): [`mmap`](https://man7.org/linux/man-pages/man2/mmap.2.html) and [`mlock`](https://man7.org/linux/man-pages/man2/mlock.2.html) are operating system features that allow a program to have finer control over how memory is managed. These features only apply to system RAM, if the model is fully loaded in VRAM they have no effect. `use_mmap` causes the model to be mapped in to the virtual address space of the process, rather than the process having to use physical RAM to hold the model weights. In theory this leaves more RAM available for things like context buffer and other memory allocations, while the model weights are read from disk as required. In practice, reading model weights results in them being stored in the page cache, potentially causing swapping and page thrashing as the CPU processes the weights into the context buffer. However, because it's a read only operation, mmaping a large model is usually more efficient than loading it in to swap. `use_mlock` locks memory pages into physical memory so that they don't get paged out when the operating system decides it needs to make space for something. Because the pages are always resident, inference is faster, since the model weights don't need to be paged in from swap. In your case, if you have `"use_mmap":false` and `"use_mlock":true`, then the model weights are being loaded into system RAM and locked in place to prevent swapping. When you set `"use_mmap":true`, the model is being read from SSD rather than being loaded into system RAM.
Author
Owner

@Greatz08 commented on GitHub (Feb 7, 2025):

@rick-github
Logs of 32B model which i tried to run under 8GB VRAM. Used Max quantized version which was 10GB in size.

-- Boot 9db74ba14523486dbc56ce8d7cd65f48 --
Feb 07 19:24:34 AIbo systemd[1]: Started Ollama Service.
Feb 07 19:24:34 AIbo ollama[3046]: 2025/02/07 19:24:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.744+05:30 level=INFO source=images.go:753 msg="total blobs: 14"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)"
Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1116644103/runners
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v12]"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.257+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:400 msg="amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:323 msg="unable to verify rocm library, will use cpu" error="no suitable rocm found, falling back to CPU"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=INFO source=types.go:107 msg="inference compute" id=GPU-dfc809ba-80dc-48e0-0673-4e7c6101a035 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4060 Laptop GPU" total="7.6 GiB" available="7.6 GiB"
Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 26.159µs | 127.0.0.1 | HEAD "/"
Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 16.235891ms | 127.0.0.1 | POST "/api/show"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=server.go:103 msg="system memory" total="14.3 GiB" free="9.2 GiB" free_swap="0 B"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1116644103/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --threads 4 --no-mmap --parallel 1 --port 45853"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] build info | build=3670 commit="5ef211c5d" tid="140457483812864" timestamp=1738936479
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140457483812864" timestamp=1738936479 total_threads=16
Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45853" tid="140457483812864" timestamp=1738936479
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 (version GGUF V3 (latest))
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 0: general.architecture str = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 1: general.type str = model
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 2: general.name str = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B P...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 3: general.version str = v0.1
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 4: general.finetune str = Preview
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 5: general.basename str = FuseO1-DeepSeekR1-Qwen2.5-Coder
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 6: general.size_label str = 32B
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 7: general.license str = apache-2.0
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 8: qwen2.block_count u32 = 64
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 9: qwen2.context_length u32 = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 10: qwen2.embedding_length u32 = 5120
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 11: qwen2.feed_forward_length u32 = 27648
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 12: qwen2.attention.head_count u32 = 40
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 13: qwen2.attention.head_count_kv u32 = 8
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 14: qwen2.rope.freq_base f32 = 1000000.000000
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 15: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 151646
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151643
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 26: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 27: general.quantization_version u32 = 2
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 28: general.file_type u32 = 20
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/FuseO1-DeepSeekR1-Qwen2.5...
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 448
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type f32: 321 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q2_K: 9 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q4_K: 64 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q5_K: 1 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type iq2_xs: 376 tensors
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: special tokens cache size = 22
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: token to piece cache size = 0.9310 MB
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: format = GGUF V3 (latest)
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: arch = qwen2
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab type = BPE
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_vocab = 152064
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_merges = 151387
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab_only = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_train = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd = 5120
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_layer = 64
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head = 40
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head_kv = 8
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_rot = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_swa = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_k = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_v = 128
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_gqa = 5
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_k_gqa = 1024
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_v_gqa = 1024
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_eps = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_logit_scale = 0.0e+00
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ff = 27648
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert_used = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: causal attn = 1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: pooling type = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope type = 2
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope scaling = linear
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_base_train = 1000000.0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_scale_train = 1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope_finetuned = unknown
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_conv = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_inner = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_state = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_rank = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model type = ?B
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model ftype = IQ2_XS - 2.3125 bpw
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model params = 32.76 B
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model size = 9.27 GiB (2.43 BPW)
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: general.name = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B Preview v0.1
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: LF token = 148848 'ÄĬ'
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: max token length = 256
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: found 1 CUDA devices:
Feb 07 19:24:39 AIbo ollama[3046]: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.664+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Feb 07 19:24:39 AIbo ollama[3046]: llm_load_tensors: ggml ctx size = 0.68 MiB
Feb 07 19:24:40 AIbo ollama[3046]: ggml_cuda_host_malloc: failed to allocate 3784.96 MiB of pinned memory: invalid argument
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloading 42 repeating layers to GPU
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloaded 42/65 layers to GPU
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CPU buffer size = 3784.96 MiB
Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CUDA0 buffer size = 5705.60 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ctx = 2048
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_batch = 512
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ubatch = 512
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: flash_attn = 0
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_base = 1000000.0
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_scale = 1
Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA_Host KV buffer size = 176.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA0 KV buffer size = 336.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB
Feb 07 19:24:43 AIbo ollama[3253]: [1738936483] warming up the model with an empty run
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA0 compute buffer size = 817.47 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph nodes = 2246
Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph splits = 312
Feb 07 19:24:44 AIbo ollama[3253]: INFO [main] model loaded | tid="140457483812864" timestamp=1738936484
Feb 07 19:24:44 AIbo ollama[3046]: time=2025-02-07T19:24:44.428+05:30 level=INFO source=server.go:626 msg="llama runner started in 5.02 seconds"
Feb 07 19:24:44 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:44 | 200 | 5.159778121s | 127.0.0.1 | POST "/api/generate"
Feb 07 19:24:56 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:56 | 200 | 10.472921533s | 127.0.0.1 | POST "/api/chat"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 31.813µs | 127.0.0.1 | HEAD "/"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.813277ms | 127.0.0.1 | POST "/api/show"
Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.533423ms | 127.0.0.1 | POST "/api/generate"
Feb 07 19:26:59 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:26:59 | 200 | 1m42s | 127.0.0.1 | POST "/api/chat"

I know that it will auto allocate remaining model weights to ram and it did i could see from nvtop BUT when i was observing its generation i was seeing the graph and gpu usage which was consistently shifting from 10% to 30% not exceeding 50%+ altho it should i guess ? when max size is loaded in VRAM , i thought i could get better than 3.36 t/s if it had used more gpu power.

So what all things or params i can or i should set to get max possible performance in this type of situation where i cant fully load model but max model weights are in VRAM only and my ram is also decent DDR5 5600 .

You also mentioned about mlock thing so how can i set that and will it help in getting better performance in my scenario.

<!-- gh-comment-id:2643044913 --> @Greatz08 commented on GitHub (Feb 7, 2025): @rick-github Logs of 32B model which i tried to run under 8GB VRAM. Used Max quantized version which was 10GB in size. > -- Boot 9db74ba14523486dbc56ce8d7cd65f48 -- Feb 07 19:24:34 AIbo systemd[1]: Started Ollama Service. Feb 07 19:24:34 AIbo ollama[3046]: 2025/02/07 19:24:34 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.744+05:30 level=INFO source=images.go:753 msg="total blobs: 14" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=routes.go:1200 msg="Listening on [::]:11434 (version 0.3.12)" Feb 07 19:24:34 AIbo ollama[3046]: time=2025-02-07T19:24:34.746+05:30 level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama1116644103/runners Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v12]" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=INFO source=gpu.go:199 msg="looking for compatible GPUs" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:37 AIbo ollama[3046]: time=2025-02-07T19:24:37.252+05:30 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.257+05:30 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:400 msg="amdgpu detected, but no compatible rocm library found. Either install rocm v6, or follow manual install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=WARN source=amd_linux.go:323 msg="unable to verify rocm library, will use cpu" error="no suitable rocm found, falling back to CPU" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.260+05:30 level=INFO source=types.go:107 msg="inference compute" id=GPU-dfc809ba-80dc-48e0-0673-4e7c6101a035 library=cuda variant=v12 compute=8.9 driver=12.7 name="NVIDIA GeForce RTX 4060 Laptop GPU" total="7.6 GiB" available="7.6 GiB" Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 26.159µs | 127.0.0.1 | HEAD "/" Feb 07 19:24:39 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:39 | 200 | 16.235891ms | 127.0.0.1 | POST "/api/show" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=server.go:103 msg="system memory" total="14.3 GiB" free="9.2 GiB" free_swap="0 B" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1116644103/runners/cuda_v12/ollama_llama_server --model /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --threads 4 --no-mmap --parallel 1 --port 45853" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=sched.go:449 msg="loaded runners" count=1 Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.413+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] build info | build=3670 commit="5ef211c5d" tid="140457483812864" timestamp=1738936479 Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140457483812864" timestamp=1738936479 total_threads=16 Feb 07 19:24:39 AIbo ollama[3253]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="45853" tid="140457483812864" timestamp=1738936479 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /var/lib/ollama/blobs/sha256-1263bcbbb4d5496a3cb3a5922f95dd886a971303fbb2044e8f644024a08cd629 (version GGUF V3 (latest)) Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 0: general.architecture str = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 1: general.type str = model Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 2: general.name str = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B P... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 3: general.version str = v0.1 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 4: general.finetune str = Preview Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 5: general.basename str = FuseO1-DeepSeekR1-Qwen2.5-Coder Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 6: general.size_label str = 32B Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 7: general.license str = apache-2.0 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 8: qwen2.block_count u32 = 64 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 9: qwen2.context_length u32 = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 10: qwen2.embedding_length u32 = 5120 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 11: qwen2.feed_forward_length u32 = 27648 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 12: qwen2.attention.head_count u32 = 40 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 13: qwen2.attention.head_count_kv u32 = 8 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 14: qwen2.rope.freq_base f32 = 1000000.000000 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 15: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 17: tokenizer.ggml.pre str = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 151646 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151643 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 26: tokenizer.chat_template str = {% if not add_generation_prompt is de... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 27: general.quantization_version u32 = 2 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 28: general.file_type u32 = 20 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/FuseO1-DeepSeekR1-Qwen2.5... Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 448 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128 Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type f32: 321 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q2_K: 9 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q4_K: 64 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type q5_K: 1 tensors Feb 07 19:24:39 AIbo ollama[3046]: llama_model_loader: - type iq2_xs: 376 tensors Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: special tokens cache size = 22 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_vocab: token to piece cache size = 0.9310 MB Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: format = GGUF V3 (latest) Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: arch = qwen2 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab type = BPE Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_vocab = 152064 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_merges = 151387 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: vocab_only = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_train = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd = 5120 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_layer = 64 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head = 40 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_head_kv = 8 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_rot = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_swa = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_k = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_head_v = 128 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_gqa = 5 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_k_gqa = 1024 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_embd_v_gqa = 1024 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_eps = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: f_logit_scale = 0.0e+00 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ff = 27648 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_expert_used = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: causal attn = 1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: pooling type = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope type = 2 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope scaling = linear Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_base_train = 1000000.0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: freq_scale_train = 1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: rope_finetuned = unknown Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_conv = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_inner = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_d_state = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_rank = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model type = ?B Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model ftype = IQ2_XS - 2.3125 bpw Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model params = 32.76 B Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: model size = 9.27 GiB (2.43 BPW) Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: general.name = FuseO1 DeepSeekR1 Qwen2.5 Coder 32B Preview v0.1 Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: LF token = 148848 'ÄĬ' Feb 07 19:24:39 AIbo ollama[3046]: llm_load_print_meta: max token length = 256 Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 07 19:24:39 AIbo ollama[3046]: ggml_cuda_init: found 1 CUDA devices: Feb 07 19:24:39 AIbo ollama[3046]: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.664+05:30 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Feb 07 19:24:39 AIbo ollama[3046]: llm_load_tensors: ggml ctx size = 0.68 MiB Feb 07 19:24:40 AIbo ollama[3046]: ggml_cuda_host_malloc: failed to allocate 3784.96 MiB of pinned memory: invalid argument Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloading 42 repeating layers to GPU Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: offloaded 42/65 layers to GPU Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CPU buffer size = 3784.96 MiB Feb 07 19:24:40 AIbo ollama[3046]: llm_load_tensors: CUDA0 buffer size = 5705.60 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ctx = 2048 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_batch = 512 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: n_ubatch = 512 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: flash_attn = 0 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_base = 1000000.0 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: freq_scale = 1 Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA_Host KV buffer size = 176.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_kv_cache_init: CUDA0 KV buffer size = 336.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB Feb 07 19:24:43 AIbo ollama[3253]: [1738936483] warming up the model with an empty run Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA0 compute buffer size = 817.47 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph nodes = 2246 Feb 07 19:24:43 AIbo ollama[3046]: llama_new_context_with_model: graph splits = 312 Feb 07 19:24:44 AIbo ollama[3253]: INFO [main] model loaded | tid="140457483812864" timestamp=1738936484 Feb 07 19:24:44 AIbo ollama[3046]: time=2025-02-07T19:24:44.428+05:30 level=INFO source=server.go:626 msg="llama runner started in 5.02 seconds" Feb 07 19:24:44 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:44 | 200 | 5.159778121s | 127.0.0.1 | POST "/api/generate" Feb 07 19:24:56 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:24:56 | 200 | 10.472921533s | 127.0.0.1 | POST "/api/chat" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 31.813µs | 127.0.0.1 | HEAD "/" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.813277ms | 127.0.0.1 | POST "/api/show" Feb 07 19:25:10 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:25:10 | 200 | 10.533423ms | 127.0.0.1 | POST "/api/generate" Feb 07 19:26:59 AIbo ollama[3046]: [GIN] 2025/02/07 - 19:26:59 | 200 | 1m42s | 127.0.0.1 | POST "/api/chat" I know that it will auto allocate remaining model weights to ram and it did i could see from nvtop BUT when i was observing its generation i was seeing the graph and gpu usage which was consistently shifting from 10% to 30% not exceeding 50%+ altho it should i guess ? when max size is loaded in VRAM , i thought i could get better than 3.36 t/s if it had used more gpu power. So what all things or params i can or i should set to get max possible performance in this type of situation where i cant fully load model but max model weights are in VRAM only and my ram is also decent DDR5 5600 . You also mentioned about mlock thing so how can i set that and will it help in getting better performance in my scenario.
Author
Owner

@rick-github commented on GitHub (Feb 7, 2025):

Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"

The model doesn't fit in VRAM, 23 of the layers run in system RAM where the CPU does the inference. Inference happens per layer, so the first 42 layers are processed very quickly by the GPU while the CPU waits. Then the CPU processes its 23 layers more slowly while the GPU waits. This is why the average utilization of GPU/CPU is low, part of the time they are waiting for the other processor.

The only way to get better token generation rate is to fit the model in VRAM. That means more VRAM, a more quantized model, or a different model. If only a small amount of the model was in system RAM there's a trick to giving the GPU full access, but that won't work in this case.

ollama is loading 7.4G in VRAM and 3.6G in system RAM. You have 9.2G free RAM so unless you are running some other big processes, there will be no paging of model weights to swap and use_mlock won't make a difference.

<!-- gh-comment-id:2643110077 --> @rick-github commented on GitHub (Feb 7, 2025): ``` Feb 07 19:24:39 AIbo ollama[3046]: time=2025-02-07T19:24:39.411+05:30 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=65 layers.offload=42 layers.split="" memory.available="[7.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="7.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="9.0 GiB" memory.weights.repeating="8.5 GiB" memory.weights.nonrepeating="510.5 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" ``` The model doesn't fit in VRAM, 23 of the layers run in system RAM where the CPU does the inference. Inference happens per layer, so the first 42 layers are processed very quickly by the GPU while the CPU waits. Then the CPU processes its 23 layers more slowly while the GPU waits. This is why the average utilization of GPU/CPU is low, part of the time they are waiting for the other processor. The only way to get better token generation rate is to fit the model in VRAM. That means more VRAM, a more quantized model, or a different model. If only a small amount of the model was in system RAM there's a [trick](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900) to giving the GPU full access, but that won't work in this case. ollama is loading 7.4G in VRAM and 3.6G in system RAM. You have 9.2G free RAM so unless you are running some other big processes, there will be no paging of model weights to swap and `use_mlock` won't make a difference.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31497