[GH-ISSUE #9990] Inaccurate model size display in ollama ps command #68603

Closed
opened 2026-05-04 14:34:34 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jaybom on GitHub (Mar 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9990

What is the issue?

Image

Image

Is the 'size' in ps command showing GPU memory usage? If so, why is it inconsistent with nvidia-smi's GPU memory usage display? Why in ollama version 0.5.7, when setting context length to 4096, ollama ps shows size as 40GB, but after upgrading to version 0.6.2, when setting OLLAMA_CONTEXT_LENGTH=4096, ollama ps shows size as 84GB?

C:\Users\Administrator>ollama serve
2025/03/26 11:15:46 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1024 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:8m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:F:\ollama\.ollama\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:32 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]"
time=2025-03-26T11:15:46.091+08:00 level=ERROR source=images.go:422 msg="couldn't remove blob" blob=blobs error="remove F:\ollama\.ollama\models\blobs\blobs: The directory is not empty."
time=2025-03-26T11:15:46.099+08:00 level=INFO source=images.go:432 msg="total blobs: 49"
time=2025-03-26T11:15:46.106+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-26T11:15:46.111+08:00 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)"
time=2025-03-26T11:15:46.112+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-26T11:15:46.112+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=2
time=2025-03-26T11:15:46.113+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=52 efficiency=0 threads=104
time=2025-03-26T11:15:46.113+08:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=52 efficiency=0 threads=104
time=2025-03-26T11:15:46.631+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-881b6982-5eba-2cbe-6d7b-9ac090c9a7ee library=cuda compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" overhead="1018.7 MiB"
time=2025-03-26T11:15:47.123+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-1e34a307-8fae-e07b-75bc-a69cc18fff6f library=cuda compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" overhead="255.5 MiB"
time=2025-03-26T11:15:47.128+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-881b6982-5eba-2cbe-6d7b-9ac090c9a7ee library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2025-03-26T11:15:47.129+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1e34a307-8fae-e07b-75bc-a69cc18fff6f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
[GIN] 2025/03/26 - 11:16:41 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/26 - 11:17:45 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/26 - 11:17:49 | 200 | 58.8154ms | 127.0.0.1 | POST "/api/show"
time=2025-03-26T11:17:50.080+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-03-26T11:17:50.102+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-26T11:17:50.104+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-26T11:17:50.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-26T11:17:50.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-26T11:17:50.128+08:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="86.3 GiB" free_swap="88.3 GiB"
time=2025-03-26T11:17:50.129+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.vision.block_count default=0
time=2025-03-26T11:17:50.155+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-26T11:17:50.155+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-26T11:17:50.157+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-26T11:17:50.157+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-26T11:17:50.161+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=22 layers.split=11,11 memory.available="[22.8 GiB 22.7 GiB]" memory.gpu_overhead="1.0 KiB" memory.required.full="78.9 GiB" memory.required.partial="45.1 GiB" memory.required.kv="32.0 GiB" memory.required.allocations="[22.5 GiB 22.5 GiB]" memory.weights.total="17.5 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="12.8 GiB" memory.graph.partial="12.8 GiB"
time=2025-03-26T11:17:50.162+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-26T11:17:50.162+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-26T11:17:50.162+08:00 level=INFO source=server.go:185 msg="enabling flash attention"
time=2025-03-26T11:17:50.163+08:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type=""
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from F:\ollama.ollama\models\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = QwQ 32B
llama_model_loader: - kv 3: general.basename str = QwQ
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 32B
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv 11: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: qwen2.block_count u32 = 64
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 15
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen2
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 32.76 B
print_info: general.name = QwQ 32B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-03-26T11:17:50.835+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="F:\ollama062\ollama.exe runner --model F:\ollama\.ollama\models\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb --ctx-size 131072 --batch-size 512 --n-gpu-layers 22 --threads 104 --flash-attn --no-mmap --parallel 32 --tensor-split 11,11 --port 60945"
time=2025-03-26T11:17:50.844+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-26T11:17:50.846+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-03-26T11:17:50.848+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-03-26T11:17:50.918+08:00 level=INFO source=runner.go:846 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from F:\ollama062\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from F:\ollama062\lib\ollama\ggml-cpu-skylakex.dll
time=2025-03-26T11:17:51.906+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-26T11:17:51.908+08:00 level=INFO source=runner.go:906 msg="Server listening on 127.0.0.1:60945"
time=2025-03-26T11:17:52.117+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from F:\ollama.ollama\models\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = QwQ 32B
llama_model_loader: - kv 3: general.basename str = QwQ
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 32B
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv 11: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: qwen2.block_count u32 = 64
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 15
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 18.48 GiB (4.85 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 64
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 27648
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 32B
print_info: model params = 32.76 B
print_info: general.name = QwQ 32B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 22 repeating layers to GPU
load_tensors: offloaded 22/65 layers to GPU
load_tensors: CUDA_Host model buffer size = 12283.30 MiB
load_tensors: CUDA0 model buffer size = 3022.29 MiB
load_tensors: CUDA1 model buffer size = 3202.76 MiB
load_tensors: CPU model buffer size = 417.66 MiB
[GIN] 2025/03/26 - 11:18:03 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/26 - 11:18:03 | 200 | 18.6µs | 127.0.0.1 | GET "/api/ps"
llama_init_from_model: n_seq_max = 32
llama_init_from_model: n_ctx = 131072
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 16384
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 131072, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 5632.00 MiB
llama_kv_cache_init: CPU KV buffer size = 21504.00 MiB
llama_init_from_model: KV self size = 32768.00 MiB, K (f16): 16384.00 MiB, V (f16): 16384.00 MiB
llama_init_from_model: CPU output buffer size = 19.19 MiB
llama_init_from_model: CUDA0 compute buffer size = 926.02 MiB
llama_init_from_model: CUDA1 compute buffer size = 256.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 266.01 MiB
llama_init_from_model: graph nodes = 1991
llama_init_from_model: graph splits = 593 (with bs=512), 4 (with bs=1)
time=2025-03-26T11:18:12.457+08:00 level=INFO source=server.go:619 msg="llama runner started in 21.61 seconds"
[GIN] 2025/03/26 - 11:18:12 | 200 | 22.5176871s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/26 - 11:20:13 | 200 | 113.7µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/03/26 - 11:20:18 | 200 | 0s | 127.0.0.1 | GET "/api/ps"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jaybom on GitHub (Mar 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9990 ### What is the issue? ![Image](https://github.com/user-attachments/assets/0c88452a-49c3-45f6-9e97-580c943eea82) ![Image](https://github.com/user-attachments/assets/2b699cbc-8e13-4fb5-b972-0ffe456f6d97) Is the 'size' in ps command showing GPU memory usage? If so, why is it inconsistent with nvidia-smi's GPU memory usage display? Why in ollama version 0.5.7, when setting context length to 4096, ollama ps shows size as 40GB, but after upgrading to version 0.6.2, when setting OLLAMA_CONTEXT_LENGTH=4096, ollama ps shows size as 84GB? C:\Users\Administrator>ollama serve 2025/03/26 11:15:46 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:1024 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:24h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:8m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:F:\\ollama\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:32 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES:]" time=2025-03-26T11:15:46.091+08:00 level=ERROR source=images.go:422 msg="couldn't remove blob" blob=blobs error="remove F:\\ollama\\.ollama\\models\\blobs\\blobs: The directory is not empty." time=2025-03-26T11:15:46.099+08:00 level=INFO source=images.go:432 msg="total blobs: 49" time=2025-03-26T11:15:46.106+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-26T11:15:46.111+08:00 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)" time=2025-03-26T11:15:46.112+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-26T11:15:46.112+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=2 time=2025-03-26T11:15:46.113+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=52 efficiency=0 threads=104 time=2025-03-26T11:15:46.113+08:00 level=INFO source=gpu_windows.go:214 msg="" package=1 cores=52 efficiency=0 threads=104 time=2025-03-26T11:15:46.631+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-881b6982-5eba-2cbe-6d7b-9ac090c9a7ee library=cuda compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" overhead="1018.7 MiB" time=2025-03-26T11:15:47.123+08:00 level=INFO source=gpu.go:319 msg="detected OS VRAM overhead" id=GPU-1e34a307-8fae-e07b-75bc-a69cc18fff6f library=cuda compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" overhead="255.5 MiB" time=2025-03-26T11:15:47.128+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-881b6982-5eba-2cbe-6d7b-9ac090c9a7ee library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2025-03-26T11:15:47.129+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-1e34a307-8fae-e07b-75bc-a69cc18fff6f library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" [GIN] 2025/03/26 - 11:16:41 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/26 - 11:17:45 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/26 - 11:17:49 | 200 | 58.8154ms | 127.0.0.1 | POST "/api/show" time=2025-03-26T11:17:50.080+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-03-26T11:17:50.102+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-26T11:17:50.104+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-26T11:17:50.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-26T11:17:50.105+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-26T11:17:50.128+08:00 level=INFO source=server.go:105 msg="system memory" total="127.7 GiB" free="86.3 GiB" free_swap="88.3 GiB" time=2025-03-26T11:17:50.129+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.vision.block_count default=0 time=2025-03-26T11:17:50.155+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-26T11:17:50.155+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-26T11:17:50.157+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-26T11:17:50.157+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-26T11:17:50.161+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=22 layers.split=11,11 memory.available="[22.8 GiB 22.7 GiB]" memory.gpu_overhead="1.0 KiB" memory.required.full="78.9 GiB" memory.required.partial="45.1 GiB" memory.required.kv="32.0 GiB" memory.required.allocations="[22.5 GiB 22.5 GiB]" memory.weights.total="17.5 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="12.8 GiB" memory.graph.partial="12.8 GiB" time=2025-03-26T11:17:50.162+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-26T11:17:50.162+08:00 level=WARN source=ggml.go:149 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-26T11:17:50.162+08:00 level=INFO source=server.go:185 msg="enabling flash attention" time=2025-03-26T11:17:50.163+08:00 level=WARN source=server.go:193 msg="kv cache type not supported by model" type="" llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from F:\ollama\.ollama\models\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = QwQ 32B llama_model_loader: - kv 3: general.basename str = QwQ llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/QWQ-32B/b... llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 32B llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B llama_model_loader: - kv 11: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 13: qwen2.block_count u32 = 64 llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 llama_model_loader: - kv 15: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 31: general.quantization_version u32 = 2 llama_model_loader: - kv 32: general.file_type u32 = 15 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.48 GiB (4.85 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = QwQ 32B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-03-26T11:17:50.835+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="F:\\ollama062\\ollama.exe runner --model F:\\ollama\\.ollama\\models\\blobs\\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb --ctx-size 131072 --batch-size 512 --n-gpu-layers 22 --threads 104 --flash-attn --no-mmap --parallel 32 --tensor-split 11,11 --port 60945" time=2025-03-26T11:17:50.844+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-26T11:17:50.846+08:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-03-26T11:17:50.848+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-03-26T11:17:50.918+08:00 level=INFO source=runner.go:846 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from F:\ollama062\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from F:\ollama062\lib\ollama\ggml-cpu-skylakex.dll time=2025-03-26T11:17:51.906+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-26T11:17:51.908+08:00 level=INFO source=runner.go:906 msg="Server listening on 127.0.0.1:60945" time=2025-03-26T11:17:52.117+08:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from F:\ollama\.ollama\models\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = QwQ 32B llama_model_loader: - kv 3: general.basename str = QwQ llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: general.license str = apache-2.0 llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/QWQ-32B/b... llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 32B llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B llama_model_loader: - kv 11: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 13: qwen2.block_count u32 = 64 llama_model_loader: - kv 14: qwen2.context_length u32 = 131072 llama_model_loader: - kv 15: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 31: general.quantization_version u32 = 2 llama_model_loader: - kv 32: general.file_type u32 = 15 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.48 GiB (4.85 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 40 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 5 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 27648 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = QwQ 32B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 22 repeating layers to GPU load_tensors: offloaded 22/65 layers to GPU load_tensors: CUDA_Host model buffer size = 12283.30 MiB load_tensors: CUDA0 model buffer size = 3022.29 MiB load_tensors: CUDA1 model buffer size = 3202.76 MiB load_tensors: CPU model buffer size = 417.66 MiB [GIN] 2025/03/26 - 11:18:03 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/26 - 11:18:03 | 200 | 18.6µs | 127.0.0.1 | GET "/api/ps" llama_init_from_model: n_seq_max = 32 llama_init_from_model: n_ctx = 131072 llama_init_from_model: n_ctx_per_seq = 4096 llama_init_from_model: n_batch = 16384 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 1 llama_init_from_model: freq_base = 1000000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 131072, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 llama_kv_cache_init: CUDA0 KV buffer size = 5632.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 5632.00 MiB llama_kv_cache_init: CPU KV buffer size = 21504.00 MiB llama_init_from_model: KV self size = 32768.00 MiB, K (f16): 16384.00 MiB, V (f16): 16384.00 MiB llama_init_from_model: CPU output buffer size = 19.19 MiB llama_init_from_model: CUDA0 compute buffer size = 926.02 MiB llama_init_from_model: CUDA1 compute buffer size = 256.00 MiB llama_init_from_model: CUDA_Host compute buffer size = 266.01 MiB llama_init_from_model: graph nodes = 1991 llama_init_from_model: graph splits = 593 (with bs=512), 4 (with bs=1) time=2025-03-26T11:18:12.457+08:00 level=INFO source=server.go:619 msg="llama runner started in 21.61 seconds" [GIN] 2025/03/26 - 11:18:12 | 200 | 22.5176871s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/26 - 11:20:13 | 200 | 113.7µs | 127.0.0.1 | HEAD "/" [GIN] 2025/03/26 - 11:20:18 | 200 | 0s | 127.0.0.1 | GET "/api/ps" ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 14:34:34 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 26, 2025):

If so, why is it inconsistent with nvidia-smi's GPU memory usage display?

The ollama server estimates memory usage but it's the GPU backend that actually allocates it. If the GPU backend is more efficient than the ollama serve expects (eg, use of flash attention or sliding window) then the estimate (the output of ollama ps) will be wrong. See #6160, #9987. You can improve VRAM utilization by overriding num_gpu and increasing the offloaded layer count.

Why in ollama version 0.5.7, when setting context length to 4096, ollama ps shows size as 40GB, but after upgrading to version 0.6.2, when setting OLLAMA_CONTEXT_LENGTH=4096, ollama ps shows size as 84GB?

How did you set the context length in 0.5.7? OLLAMA_CONTEXT_LENGTH is not supported in 0.5.7 (introduced in 0.5.13), so ollama will use the default of 2048.

<!-- gh-comment-id:2754119498 --> @rick-github commented on GitHub (Mar 26, 2025): > If so, why is it inconsistent with nvidia-smi's GPU memory usage display? The ollama server estimates memory usage but it's the GPU backend that actually allocates it. If the GPU backend is more efficient than the ollama serve expects (eg, use of flash attention or sliding window) then the estimate (the output of `ollama ps`) will be wrong. See #6160, #9987. You can improve VRAM utilization by [overriding `num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) and increasing the offloaded layer count. > Why in ollama version 0.5.7, when setting context length to 4096, ollama ps shows size as 40GB, but after upgrading to version 0.6.2, when setting OLLAMA_CONTEXT_LENGTH=4096, ollama ps shows size as 84GB? How did you set the context length in 0.5.7? `OLLAMA_CONTEXT_LENGTH` is not supported in 0.5.7 (introduced in 0.5.13), so ollama will use the default of 2048.
Author
Owner

@tim-roethig commented on GitHub (Mar 27, 2025):

But still based on the VRAM usage seen in ollama ps it is decided whether an older model in unloaded from gpu isn't it?
So setting num_gpu would only help in the case when layers are offloaded to the CPU and there is still memory left, not in the case when models are wrongfully offloaded from the gpu, right?

<!-- gh-comment-id:2758459106 --> @tim-roethig commented on GitHub (Mar 27, 2025): But still based on the VRAM usage seen in ollama ps it is decided whether an older model in unloaded from gpu isn't it? So setting num_gpu would only help in the case when layers are offloaded to the CPU and there is still memory left, not in the case when models are wrongfully offloaded from the gpu, right?
Author
Owner

@tim-roethig commented on GitHub (Mar 27, 2025):

Will this commit fix the behavior? f66216e399

<!-- gh-comment-id:2758558969 --> @tim-roethig commented on GitHub (Mar 27, 2025): Will this commit fix the behavior? https://github.com/ollama/ollama/commit/f66216e3990b73869341c58ac9561b26c468c558
Author
Owner

@rick-github commented on GitHub (Mar 27, 2025):

But still based on the VRAM usage seen in ollama ps it is decided whether an older model in unloaded from gpu isn't it?

Yes, the estimated VRAM of current runners is used when determining how much VRAM is available.

Will this commit fix the behavior?

It helps, but only for models that have a sliding window (gemma2/3, llama3.2-vision)

<!-- gh-comment-id:2758612990 --> @rick-github commented on GitHub (Mar 27, 2025): > But still based on the VRAM usage seen in ollama ps it is decided whether an older model in unloaded from gpu isn't it? Yes, the estimated VRAM of current runners is used when determining how much VRAM is available. > Will this commit fix the behavior? It helps, but only for models that have a sliding window (gemma2/3, llama3.2-vision)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68603