[GH-ISSUE #11104] Memory leak with Kimi-Dev-72B? #69384

New Issue

GiteaMirror · 2026-05-04T17:57:18-05:00

GiteaMirror commented

2026-05-04 17:57:18 -05:00

Originally created by @blakkd on GitHub (Jun 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11104

What is the issue?

When trying to load a Q3_XL quant from both Bartowski (regular quant) and Unsloth (the dynamic one mixing K-qants and IQ-quants layers), ollama doesn't load on GPU despite setting even a small num_gpu value like 5 It goes straight to RAM instead, leading to OOM.
Plus, when exiting Ollama (CTRL-C), only a portion of the RAM gets freed up, even if stopped before the OOM happens (like when only loaded 5GB of the model).
The system can't even reclaim this used RAM, so it seems like a leaking related thing.
The only way I found was rebooting.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.9.1

Screenshot

Originally created by @blakkd on GitHub (Jun 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11104 ### What is the issue? When trying to load a Q3_XL quant from both Bartowski (regular quant) and Unsloth (the dynamic one mixing K-qants and IQ-quants layers), ollama doesn't load on GPU despite setting even a small num_gpu value like 5 It goes straight to RAM instead, leading to OOM. Plus, when exiting Ollama (CTRL-C), only a portion of the RAM gets freed up, even if stopped before the OOM happens (like when only loaded 5GB of the model). The system can't even reclaim this used RAM, so it seems like a leaking related thing. The only way I found was rebooting. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.1 ### Screenshot ![Image](https://github.com/user-attachments/assets/14dde425-8e88-4e22-a04a-c56c08a49ad2)

GiteaMirror added the bug label 2026-05-04 17:57:18 -05:00

GiteaMirror closed this issue

2026-05-04 17:57:19 -05:00

GiteaMirror commented

2026-05-04 17:57:21 -05:00

@rick-github commented on GitHub (Jun 17, 2025):

Server logs may aid in debugging, but the likely cause of no layers on GPU is that the minimum free VRAM that ollama wants before allocating layers is not available on the device. Forcing layers on with num_gpu then exceeds ollama's estimation and the runner OOMs. Since you are running Linux with Nvidia, you can try setting GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 as discussed here.

ollama is a server client architecture and the server stays running when you exit the ollama client. The server will keep a model loaded for 5 minutes after it falls idle, which can be adjusted by setting keep_alive.

@rick-github commented on GitHub (Jun 17, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging, but the likely cause of no layers on GPU is that the minimum free VRAM that ollama wants before allocating layers is not available on the device. Forcing layers on with `num_gpu` then exceeds ollama's estimation and the runner OOMs. Since you are running Linux with Nvidia, you can try setting `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` as discussed [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288). ollama is a server client architecture and the server stays running when you exit the ollama client. The server will keep a model loaded for 5 minutes after it falls idle, which can be adjusted by setting [`keep_alive`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately).

GiteaMirror commented

2026-05-04 17:57:22 -05:00

@blakkd commented on GitHub (Jun 17, 2025):

Really sorry @rick-github I was trying to better describe the issue which is even unacurate!
But I had to reboot multiple times because of the thing, so sorry for the delay, I should have deleted it instead of wanting to modify afterward...

So, what I said about the VRAM not being used was wrong, but still it seems something is wrong with the memory release.

So here is my screen recording: I'll also attach the logs of ollama serve along with ollama run --verbose below the recording.

https://github.com/user-attachments/assets/0d767af0-f593-409f-9dd0-64c6dd298413

Rebooting lol, I'll past the logs after that.

@blakkd commented on GitHub (Jun 17, 2025): Really sorry @rick-github I was trying to better describe the issue which is even unacurate! But I had to reboot multiple times because of the thing, so sorry for the delay, I should have deleted it instead of wanting to modify afterward... So, what I said about the VRAM not being used was wrong, but still it seems something is wrong with the memory release. So here is my screen recording: I'll also attach the logs of ollama serve along with ollama run --verbose below the recording. https://github.com/user-attachments/assets/0d767af0-f593-409f-9dd0-64c6dd298413 Rebooting lol, I'll past the logs after that.

GiteaMirror commented

2026-05-04 17:57:23 -05:00

@blakkd commented on GitHub (Jun 17, 2025):

OK so here is the ollama serve log`:

~ ❯❯❯ ollama serve
time=2025-06-18T01:16:05.234+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-06-18T01:16:05.241+02:00 level=INFO source=images.go:476 msg="total blobs: 126"
time=2025-06-18T01:16:05.243+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-06-18T01:16:05.244+02:00 level=INFO source=routes.go:1288 msg="Listening on [::]:11434 (version 0.9.1)"
time=2025-06-18T01:16:05.244+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-18T01:16:05.548+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9465e539-604b-36ff-5748-e5f9d18409ea library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.2 GiB"
[GIN] 2025/06/18 - 01:16:07 | 200 |      47.209µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/06/18 - 01:16:07 | 200 |   50.281327ms |       127.0.0.1 | POST     "/api/show"
time=2025-06-18T01:16:08.343+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 gpu=GPU-9465e539-604b-36ff-5748-e5f9d18409ea parallel=1 available=24867307520 required="15.6 GiB"
time=2025-06-18T01:16:08.505+02:00 level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="29.0 GiB" free_swap="4.0 GiB"
time=2025-06-18T01:16:08.666+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=30 layers.model=81 layers.offload=30 layers.split="" memory.available="[23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="15.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="36.6 GiB" memory.weights.repeating="35.3 GiB" memory.weights.nonrepeating="1.2 GiB" memory.graph.full="313.0 MiB" memory.graph.partial="1.3 GiB"
time=2025-06-18T01:16:08.666+02:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-06-18T01:16:08.666+02:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Kimi Dev 72B
llama_model_loader: - kv   3:                           general.basename str              = Kimi-Dev
llama_model_loader: - kv   4:                         general.size_label str              = 72B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 72B
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-72B
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["code", "swebench", "software", "iss...
llama_model_loader: - kv  11:                          qwen2.block_count u32              = 80
llama_model_loader: - kv  12:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  13:                     qwen2.embedding_length u32              = 8192
llama_model_loader: - kv  14:                  qwen2.feed_forward_length u32              = 29568
llama_model_loader: - kv  15:                 qwen2.attention.head_count u32              = 64
llama_model_loader: - kv  16:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 13
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Kimi-Dev-72B-GGUF/moonsho...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count u32              = 560
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count u32              = 498
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_1:   80 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q3_K:  320 tensors
llama_model_loader: - type q5_K:  160 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Large
print_info: file size   = 37.80 GiB (4.47 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 72.71 B
print_info: general.name     = Kimi Dev 72B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-06-18T01:16:08.877+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 --ctx-size 256 --batch-size 512 --n-gpu-layers 30 --threads 16 --flash-attn --no-mmap --parallel 1 --port 33519"
time=2025-06-18T01:16:08.877+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-18T01:16:08.877+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-06-18T01:16:08.878+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-18T01:16:08.893+02:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-06-18T01:16:09.097+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-06-18T01:16:09.098+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:33519"
time=2025-06-18T01:16:09.128+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23715 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Kimi Dev 72B
llama_model_loader: - kv   3:                           general.basename str              = Kimi-Dev
llama_model_loader: - kv   4:                         general.size_label str              = 72B
llama_model_loader: - kv   5:                            general.license str              = mit
llama_model_loader: - kv   6:                   general.base_model.count u32              = 1
llama_model_loader: - kv   7:                  general.base_model.0.name str              = Qwen2.5 72B
llama_model_loader: - kv   8:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv   9:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-72B
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["code", "swebench", "software", "iss...
llama_model_loader: - kv  11:                          qwen2.block_count u32              = 80
llama_model_loader: - kv  12:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  13:                     qwen2.embedding_length u32              = 8192
llama_model_loader: - kv  14:                  qwen2.feed_forward_length u32              = 29568
llama_model_loader: - kv  15:                 qwen2.attention.head_count u32              = 64
llama_model_loader: - kv  16:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 13
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Kimi-Dev-72B-GGUF/moonsho...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count u32              = 560
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count u32              = 498
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_1:   80 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q3_K:  320 tensors
llama_model_loader: - type q5_K:  160 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Large
print_info: file size   = 37.80 GiB (4.47 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 29568
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 72.71 B
print_info: general.name     = Kimi Dev 72B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
time=2025-06-18T01:16:20.617+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-18T01:16:25.828+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_host_malloc: failed to allocate 25139.77 MiB of pinned memory: invalid argument
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/81 layers to GPU
load_tensors:          CPU model buffer size = 25139.77 MiB
load_tensors:        CUDA0 model buffer size = 13569.14 MiB
^Ctime=2025-06-18T01:16:43.393+02:00 level=WARN source=server.go:598 msg="client connection closed before server finished loading, aborting load"
time=2025-06-18T01:16:43.401+02:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2025/06/18 - 01:16:43 | 499 |     35.47885s |       127.0.0.1 | POST     "/api/generate"
^C⏎

The modelfile from which I created Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256 using:

ollama create -f Modelfile.Kimi-Dev-72B:Q3_K_XL Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256

FROM hf.co/bartowski/moonshotai_Kimi-Dev-72B-GGUF:Q3_K_XL

TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>
{{- else if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""

PARAMETER num_ctx 256
PARAMETER num_gpu 30

And again: htop after the ollama CTRL-C exit (we see no ollama process anymore):


    0[| 1.2%]   4[| 0.6%]  8[||1.2%] 12[  0.0%]  16[| 0.6%]  20[  0.0%] 24[  0.0%] 28[  0.0%]
    1[| 0.6%]   5[  0.0%]  9[| 0.6%] 13[  0.0%]  17[| 0.6%]  21[  0.0%] 25[  0.0%] 29[  0.0%]
    2[||1.9%]   6[  0.0%] 10[||1.2%] 14[  0.0%]  18[||7.4%]  22[  0.0%] 26[| 0.6%] 30[| 0.6%]
    3[  0.0%]   7[| 0.6%] 11[  0.0%] 15[| 0.6%]  19[  0.0%]  23[  0.0%] 27[  0.0%] 31[  0.0%]
  Mem[|||||||||||||||||||||||||||||28.3G/31.2G] Tasks: 162, 1259 thr, 371 kthr; 1 running
  Swp[|||||                         321M/4.00G] Load average: 0.23 0.38 0.27 
                                                Uptime: 00:13:42

The ollama run Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256 --vebose returned nothing so I understood it only prints out the logs when inference is done.

@blakkd commented on GitHub (Jun 17, 2025): OK so here is the `ollama serve` log`: ``` ~ ❯❯❯ ollama serve time=2025-06-18T01:16:05.234+02:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-06-18T01:16:05.241+02:00 level=INFO source=images.go:476 msg="total blobs: 126" time=2025-06-18T01:16:05.243+02:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-06-18T01:16:05.244+02:00 level=INFO source=routes.go:1288 msg="Listening on [::]:11434 (version 0.9.1)" time=2025-06-18T01:16:05.244+02:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-06-18T01:16:05.548+02:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-9465e539-604b-36ff-5748-e5f9d18409ea library=cuda variant=v12 compute=8.6 driver=12.9 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.2 GiB" [GIN] 2025/06/18 - 01:16:07 | 200 | 47.209µs | 127.0.0.1 | HEAD "/" [GIN] 2025/06/18 - 01:16:07 | 200 | 50.281327ms | 127.0.0.1 | POST "/api/show" time=2025-06-18T01:16:08.343+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 gpu=GPU-9465e539-604b-36ff-5748-e5f9d18409ea parallel=1 available=24867307520 required="15.6 GiB" time=2025-06-18T01:16:08.505+02:00 level=INFO source=server.go:135 msg="system memory" total="31.2 GiB" free="29.0 GiB" free_swap="4.0 GiB" time=2025-06-18T01:16:08.666+02:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=30 layers.model=81 layers.offload=30 layers.split="" memory.available="[23.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="39.3 GiB" memory.required.partial="15.6 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[15.6 GiB]" memory.weights.total="36.6 GiB" memory.weights.repeating="35.3 GiB" memory.weights.nonrepeating="1.2 GiB" memory.graph.full="313.0 MiB" memory.graph.partial="1.3 GiB" time=2025-06-18T01:16:08.666+02:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-06-18T01:16:08.666+02:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Kimi Dev 72B llama_model_loader: - kv 3: general.basename str = Kimi-Dev llama_model_loader: - kv 4: general.size_label str = 72B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: general.base_model.count u32 = 1 llama_model_loader: - kv 7: general.base_model.0.name str = Qwen2.5 72B llama_model_loader: - kv 8: general.base_model.0.organization str = Qwen llama_model_loader: - kv 9: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-72B llama_model_loader: - kv 10: general.tags arr[str,4] = ["code", "swebench", "software", "iss... llama_model_loader: - kv 11: qwen2.block_count u32 = 80 llama_model_loader: - kv 12: qwen2.context_length u32 = 131072 llama_model_loader: - kv 13: qwen2.embedding_length u32 = 8192 llama_model_loader: - kv 14: qwen2.feed_forward_length u32 = 29568 llama_model_loader: - kv 15: qwen2.attention.head_count u32 = 64 llama_model_loader: - kv 16: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 17: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 18: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 13 llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Kimi-Dev-72B-GGUF/moonsho... llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count u32 = 560 llama_model_loader: - kv 34: quantize.imatrix.chunks_count u32 = 498 llama_model_loader: - type f32: 401 tensors llama_model_loader: - type q5_1: 80 tensors llama_model_loader: - type q8_0: 2 tensors llama_model_loader: - type q3_K: 320 tensors llama_model_loader: - type q5_K: 160 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Large print_info: file size = 37.80 GiB (4.47 BPW) load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 72.71 B print_info: general.name = Kimi Dev 72B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-06-18T01:16:08.877+02:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 --ctx-size 256 --batch-size 512 --n-gpu-layers 30 --threads 16 --flash-attn --no-mmap --parallel 1 --port 33519" time=2025-06-18T01:16:08.877+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-18T01:16:08.877+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-06-18T01:16:08.878+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-18T01:16:08.893+02:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so time=2025-06-18T01:16:09.097+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-06-18T01:16:09.098+02:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:33519" time=2025-06-18T01:16:09.128+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23715 MiB free llama_model_loader: loaded meta data with 35 key-value pairs and 963 tensors from /home/user/.ollama/models/blobs/sha256-ac2e0213a80eb6312804e7ff2f21e573fbcfba6e4be71935c25e35b13d126c09 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Kimi Dev 72B llama_model_loader: - kv 3: general.basename str = Kimi-Dev llama_model_loader: - kv 4: general.size_label str = 72B llama_model_loader: - kv 5: general.license str = mit llama_model_loader: - kv 6: general.base_model.count u32 = 1 llama_model_loader: - kv 7: general.base_model.0.name str = Qwen2.5 72B llama_model_loader: - kv 8: general.base_model.0.organization str = Qwen llama_model_loader: - kv 9: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-72B llama_model_loader: - kv 10: general.tags arr[str,4] = ["code", "swebench", "software", "iss... llama_model_loader: - kv 11: qwen2.block_count u32 = 80 llama_model_loader: - kv 12: qwen2.context_length u32 = 131072 llama_model_loader: - kv 13: qwen2.embedding_length u32 = 8192 llama_model_loader: - kv 14: qwen2.feed_forward_length u32 = 29568 llama_model_loader: - kv 15: qwen2.attention.head_count u32 = 64 llama_model_loader: - kv 16: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 17: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 18: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 29: general.quantization_version u32 = 2 llama_model_loader: - kv 30: general.file_type u32 = 13 llama_model_loader: - kv 31: quantize.imatrix.file str = /models_out/Kimi-Dev-72B-GGUF/moonsho... llama_model_loader: - kv 32: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 33: quantize.imatrix.entries_count u32 = 560 llama_model_loader: - kv 34: quantize.imatrix.chunks_count u32 = 498 llama_model_loader: - type f32: 401 tensors llama_model_loader: - type q5_1: 80 tensors llama_model_loader: - type q8_0: 2 tensors llama_model_loader: - type q3_K: 320 tensors llama_model_loader: - type q5_K: 160 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q3_K - Large print_info: file size = 37.80 GiB (4.47 BPW) load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 8192 print_info: n_layer = 80 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 29568 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 70B print_info: model params = 72.71 B print_info: general.name = Kimi Dev 72B print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = false) time=2025-06-18T01:16:20.617+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-06-18T01:16:25.828+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_host_malloc: failed to allocate 25139.77 MiB of pinned memory: invalid argument load_tensors: offloading 30 repeating layers to GPU load_tensors: offloaded 30/81 layers to GPU load_tensors: CPU model buffer size = 25139.77 MiB load_tensors: CUDA0 model buffer size = 13569.14 MiB ^Ctime=2025-06-18T01:16:43.393+02:00 level=WARN source=server.go:598 msg="client connection closed before server finished loading, aborting load" time=2025-06-18T01:16:43.401+02:00 level=ERROR source=sched.go:489 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2025/06/18 - 01:16:43 | 499 | 35.47885s | 127.0.0.1 | POST "/api/generate" ^C⏎ ``` The modelfile from which I created `Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256` using: `ollama create -f Modelfile.Kimi-Dev-72B:Q3_K_XL Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256` ``` FROM hf.co/bartowski/moonshotai_Kimi-Dev-72B-GGUF:Q3_K_XL TEMPLATE """{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|> {{- else if .Messages }} {{- if or .System .Tools }}<|im_start|>system {{- if .System }} {{ .System }} {{- end }} {{- if .Tools }} # Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {{- range .Tools }} {"type": "function", "function": {{ .Function }}} {{- end }} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> {{- end }}<|im_end|> {{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 -}} {{- if eq .Role "user" }}<|im_start|>user {{ .Content }}<|im_end|> {{ else if eq .Role "assistant" }}<|im_start|>assistant {{ if .Content }}{{ .Content }} {{- else if .ToolCalls }}<tool_call> {{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}} {{ end }}</tool_call> {{- end }}{{ if not $last }}<|im_end|> {{ end }} {{- else if eq .Role "tool" }}<|im_start|>user <tool_response> {{ .Content }} </tool_response><|im_end|> {{ end }} {{- if and (ne .Role "assistant") $last }}<|im_start|>assistant {{ end }} {{- end }} {{- else }} {{- if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}""" PARAMETER num_ctx 256 PARAMETER num_gpu 30 ``` And again: `htop` after the ollama CTRL-C exit (we see no ollama process anymore): ``` 0[| 1.2%] 4[| 0.6%] 8[||1.2%] 12[ 0.0%] 16[| 0.6%] 20[ 0.0%] 24[ 0.0%] 28[ 0.0%] 1[| 0.6%] 5[ 0.0%] 9[| 0.6%] 13[ 0.0%] 17[| 0.6%] 21[ 0.0%] 25[ 0.0%] 29[ 0.0%] 2[||1.9%] 6[ 0.0%] 10[||1.2%] 14[ 0.0%] 18[||7.4%] 22[ 0.0%] 26[| 0.6%] 30[| 0.6%] 3[ 0.0%] 7[| 0.6%] 11[ 0.0%] 15[| 0.6%] 19[ 0.0%] 23[ 0.0%] 27[ 0.0%] 31[ 0.0%] Mem[|||||||||||||||||||||||||||||28.3G/31.2G] Tasks: 162, 1259 thr, 371 kthr; 1 running Swp[||||| 321M/4.00G] Load average: 0.23 0.38 0.27 Uptime: 00:13:42 ``` The `ollama run Kimi-Dev-72B:Q3_K_XL_num_gpu30-num_ctx256 --vebose` returned nothing so I understood it only prints out the logs when inference is done.

GiteaMirror commented

2026-05-04 17:57:24 -05:00

@blakkd commented on GitHub (Jun 17, 2025):

The same occurs with:

bartowski/moonshotai_Kimi-Dev-72B-GGUF:Q3_K_XL
unsloth/Kimi-Dev-72B-GGUF:Q3_K_XL
unsloth/Kimi-Dev-72B-GGUF:Q4_0

@blakkd commented on GitHub (Jun 17, 2025): The same occurs with: - `bartowski/moonshotai_Kimi-Dev-72B-GGUF:Q3_K_XL` - `unsloth/Kimi-Dev-72B-GGUF:Q3_K_XL` - `unsloth/Kimi-Dev-72B-GGUF:Q4_0`

GiteaMirror commented

2026-05-04 17:57:25 -05:00

@blakkd commented on GitHub (Jun 17, 2025):

And I know Unsloth and Bartowski both had quantization issues with some quants, so maybe that's related?
Either way, I think ollama should handle such broken quants can't if ever they are.
https://huggingface.co/bartowski/moonshotai_Kimi-Dev-72B-GGUF

Had issues with some Q4_K quants getting NaN, making rest public for now, will fix others when I have a chance and upload them

https://huggingface.co/unsloth/Kimi-Dev-72B-GGUF/discussions/1

Thanks there mustve been something wrong with the conversion process

But still, I don't know if it can replicated by someone.

@blakkd commented on GitHub (Jun 17, 2025): And I know Unsloth and Bartowski both had quantization issues with some quants, so maybe that's related? Either way, I think ollama should handle such broken quants can't if ever they are. https://huggingface.co/bartowski/moonshotai_Kimi-Dev-72B-GGUF > Had issues with some Q4_K quants getting NaN, making rest public for now, will fix others when I have a chance and upload them https://huggingface.co/unsloth/Kimi-Dev-72B-GGUF/discussions/1 > Thanks there mustve been something wrong with the conversion process But still, I don't know if it can replicated by someone.

GiteaMirror commented

2026-05-04 17:57:27 -05:00

@rick-github commented on GitHub (Jun 18, 2025):

The OS using all RAM is normal, it's called the page cache. It's not an ollama memory leak.

@rick-github commented on GitHub (Jun 18, 2025): The OS using all RAM is normal, it's called the [page cache](https://en.wikipedia.org/wiki/Page_cache). It's not an ollama memory leak.

GiteaMirror commented

2026-05-04 17:57:30 -05:00

@blakkd commented on GitHub (Jun 18, 2025):

No but shouldn't it fit? I have ~50GB of [RAM+VRAM]. The models are ~40GB and I load with a context window of 256

@blakkd commented on GitHub (Jun 18, 2025): No but shouldn't it fit? I have ~50GB of [RAM+VRAM]. The models are ~40GB and I load with a context window of 256

GiteaMirror commented

2026-05-04 17:57:36 -05:00

@rick-github commented on GitHub (Jun 18, 2025):

The log shows it was loading until the client was terminated.

@rick-github commented on GitHub (Jun 18, 2025): The log shows it was loading until the client was terminated.

GiteaMirror commented

2026-05-04 17:57:37 -05:00

@blakkd commented on GitHub (Jun 18, 2025):

I'm pretty sure I loaded a 70b before, I'm gonna test with a lower quant right now so we can close this issue if it's my fault

@blakkd commented on GitHub (Jun 18, 2025): I'm pretty sure I loaded a 70b before, I'm gonna test with a lower quant right now so we can close this issue if it's my fault

GiteaMirror commented

2026-05-04 17:57:39 -05:00

@blakkd commented on GitHub (Jun 18, 2025):

OK, so finally! A lower quant loads entirely, this was my fault. But, I still do have only a part of the RAM released after ollama server exit:

    0[0.6]  4[0.0]  8[1.2] 12[0.6]  16[0.0]  20[0.0] 24[0.0] 28[0.0]
    1[0.0]  5[0.0]  9[0.0] 13[0.0]  17[8.6]  21[0.0] 25[0.0] 29[0.0]
    2[4.9]  6[0.0] 10[0.0] 14[0.0]  18[0.6]  22[0.0] 26[0.6] 30[0.0]
    3[0.0]  7[0.6] 11[0.0] 15[0.0]  19[0.0]  23[0.6] 27[0.0] 31[0.0]
  Mem[||||||||||||||||16.5G/31.2G] Tasks: 161, 1300 thr, 378 kthr; 1
  Swp[||||||||||||||||2.72G/4.00G] Load average: 1.26 2.04 1.13 
                                   Uptime: 00:24:32

@blakkd commented on GitHub (Jun 18, 2025): OK, so finally! A lower quant loads entirely, this was my fault. But, I still do have only a part of the RAM released after ollama server exit: ``` 0[0.6] 4[0.0] 8[1.2] 12[0.6] 16[0.0] 20[0.0] 24[0.0] 28[0.0] 1[0.0] 5[0.0] 9[0.0] 13[0.0] 17[8.6] 21[0.0] 25[0.0] 29[0.0] 2[4.9] 6[0.0] 10[0.0] 14[0.0] 18[0.6] 22[0.0] 26[0.6] 30[0.0] 3[0.0] 7[0.6] 11[0.0] 15[0.0] 19[0.0] 23[0.6] 27[0.0] 31[0.0] Mem[||||||||||||||||16.5G/31.2G] Tasks: 161, 1300 thr, 378 kthr; 1 Swp[||||||||||||||||2.72G/4.00G] Load average: 1.26 2.04 1.13 Uptime: 00:24:32 ```

GiteaMirror commented

2026-05-04 17:57:40 -05:00

@rick-github commented on GitHub (Jun 18, 2025):

Run htop and watch the memory usage while running the following:

echo 3 | sudo tee /proc/sys/vm/drop_caches

@rick-github commented on GitHub (Jun 18, 2025): Run htop and watch the memory usage while running the following: ``` echo 3 | sudo tee /proc/sys/vm/drop_caches ```

GiteaMirror commented

2026-05-04 17:57:41 -05:00

@blakkd commented on GitHub (Jun 19, 2025):

Good guess!
But unfortunately I already tried with sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' or the alternative you suggested but none of them does the trick here the used RAM still isn't released :/

@blakkd commented on GitHub (Jun 19, 2025): Good guess! But unfortunately I already tried with `sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'` or the alternative you suggested but none of them does the trick here the used RAM still isn't released :/

GiteaMirror commented

2026-05-04 17:57:42 -05:00

@rick-github commented on GitHub (Jun 19, 2025):

ps axho %mem,cmd | sort -n

@rick-github commented on GitHub (Jun 19, 2025): ``` ps axho %mem,cmd | sort -n ```

GiteaMirror commented

2026-05-04 17:57:43 -05:00

@blakkd commented on GitHub (Jun 20, 2025):

I tried getting help from Qwen3 32B, AM-thinking, and to finish dumbly following Gemini 2.5 pro instructions and: no clue, all end up concluding the only solution is to reboot.

Here are all the commands I tried along with my prompts and responses, I tried to format it well to relieve the a pain to read as much as possible, cause I'm conscious it's a huge message... It was during my last attempt, using Gemini 2.5 pro.

All the commands have been ran after CTRL-C to SIGINT ollama when the model was successfully loaded and with 48/81 layers on VRAM so 33/81 on RAM (later tried with 50/81 VRAM - 31/81 RAM and to finish 53/81 VRAM - 28/81 RAM which lead to a cudaMalloc failed: out of memory, same: VRAM released, but RAM only partially).

Also, a precision: loading Qwen3-32B the same way: split on RAM/VRAM, or either entirely on RAM, then SIGINT ollama doesn't triggers this RAM not released entirely behavior that is occuring with Kimi-dev. I actually never had any issue around with any model apart with this specific one.

~ ❯❯❯ sudo sync && sudo echo 3 | sudo tee /proc/sys/vm/drop_caches

[sudo] password for user:           

3


    0[|0.6%]  4[ 0.0%]  8[|3.6%] 12[|0.6%]  16[ 0.0%]  20[ 0.0%] 24[ 0.0%] 28[ 0.0%]

    1[|4.9%]  5[ 0.0%]  9[|0.6%] 13[ 0.0%]  17[|6.1%]  21[ 0.0%] 25[ 0.0%] 29[|0.6%]

    2[|9.8%]  6[ 0.0%] 10[|0.6%] 14[ 0.0%]  18[ 0.0%]  22[ 0.0%] 26[ 0.0%] 30[|0.6%]

    3[|1.2%]  7[|0.6%] 11[|0.6%] 15[ 0.0%]  19[ 0.0%]  23[ 0.0%] 27[|0.6%] 31[|1.2%]

  Mem[||||||||||||||||||||||||17.2G/31.2G] Tasks: 174, 1484 thr, 373 kthr; 1 running

  Swp[||||||||||||||||||||||||2.65G/4.00G] Load average: 0.37 0.53 0.63 

                                           Uptime: 01:54:32



  [Main] [I/O]

    PID USER       PRI  NI  VIRT   RES   SHR S  CPU% MEM%▽  TIME+  Command

   7355 user        20   0 1392G  329M  111M S   0.0  1.0  1:56.50 /opt/brave.com/brav

   7357 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.01 /opt/brave.com/brav

   7358 user        20   0 1392G  329M  111M S   0.0  1.0  0:48.45 /opt/brave.com/brav

   7359 user        20   0 1392G  329M  111M S   0.0  1.0  0:06.39 /opt/brave.com/brav

   7360 user        20   0 1392G  329M  111M S   0.0  1.0  0:10.91 /opt/brave.com/brav

   7361 user        20   0 1392G  329M  111M S   0.0  1.0  0:31.34 /opt/brave.com/brav

   7362 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.00 /opt/brave.com/brav

   7363 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.00 /opt/brave.com/brav

   7380 user        20   0 1392G  329M  111M S   0.0  1.0  0:05.55 /opt/brave.com/brav

   7381 user        20   0 1392G  329M  111M S   0.0  1.0  0:05.29 /opt/brave.com/brav

   7382 user        20   0 1392G  329M  111M S   0.0  1.0  0:05.18 /opt/brave.com/brav

   7478 user        20   0 1392G  329M  111M S   0.0  1.0  0:05.76 /opt/brave.com/brav

   7479 user        20   0 1392G  329M  111M S   0.0  1.0  0:03.98 /opt/brave.com/brav

   7480 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.56 /opt/brave.com/brav

   7481 user        20   0 1392G  329M  111M S   0.0  1.0  0:02.90 /opt/brave.com/brav

   7482 user        20   0 1392G  329M  111M S   0.0  1.0  0:01.33 /opt/brave.com/brav

   7483 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.71 /opt/brave.com/brav

   7561 user        20   0 1392G  329M  111M S   0.0  1.0  0:00.49 /opt/brave.com/brav


~ ❯❯❯ free -h                                                               (base) 

      cat /proc/meminfo | grep -i mem

               total        used        free      shared  buff/cache   available

Mem:            31Gi        17Gi       1,5Gi        49Mi        12Gi        13Gi

Swap:          4,0Gi       2,7Gi       1,3Gi

MemTotal:       32753744 kB

MemFree:         1561404 kB

MemAvailable:   14226864 kB

Shmem:             50440 kB

ShmemHugePages:        0 kB

ShmemPmdMapped:        0 kB


~ ❯❯❯ sudo cat /proc/slabinfo | grep -i mem                                 (base) 

memdup_user-8k         0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0

memdup_user-4k         0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0

memdup_user-2k       272    272   2048   16    8 : tunables    0    0    0 : slabdata     17     17      0

memdup_user-1k       544    544   1024   32    8 : tunables    0    0    0 : slabdata     17     17      0

memdup_user-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0

memdup_user-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0

memdup_user-128      480    480    128   32    1 : tunables    0    0    0 : slabdata     15     15      0

memdup_user-64      1728   1728     64   64    1 : tunables    0    0    0 : slabdata     27     27      0

memdup_user-32      4096   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0

memdup_user-16      8192   8192     16  256    1 : tunables    0    0    0 : slabdata     32     32      0

memdup_user-8      16384  16384      8  512    1 : tunables    0    0    0 : slabdata     32     32      0

memdup_user-192      420    420    192   42    2 : tunables    0    0    0 : slabdata     10     10      0

memdup_user-96       882    882     96   42    1 : tunables    0    0    0 : slabdata     21     21      0

shmem_inode_cache   5490   5658    784   41    8 : tunables    0    0    0 : slabdata    138    138      0

kmem_cache_node     1344   1344     64   64    1 : tunables    0    0    0 : slabdata     21     21      0

kmem_cache           896    896    256   32    2 : tunables    0    0    0 : slabdata     28     28      0


~ ❯❯❯ ipcs -m                                                               (base) 



------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status      

0x00000000 32770      user       600        524288     2          dest         

0x00000000 6          user       600        67108864   2          dest         

0x00000000 9          user       600        524288     2          dest         

0x00000000 10         user       600        524288     2          dest         

0x00000000 17         user       777        8294400    2                       

0x00000000 18         user       600        4194304    2          dest         

0x00000000 32815      user       600        524288     2          dest         

0x00000000 32816      user       600        524288     2          dest         

0x00000000 32817      user       600        4194304    2          dest         

0x00000000 54         user       600        524288     2          dest         

0x00000000 57         user       600        4194304    2          dest

~ ❯❯❯ dmesg | grep -i memory                                   (base) 

      dmesg | grep -i oom

[    0.000000] DMI: Memory slots populated: 4/4

[    0.005003] ACPI: Reserving FACP table memory at [mem 0xbc6e2000-0xbc6e2113]

[    0.005004] ACPI: Reserving DSDT table memory at [mem 0xbc6bb000-0xbc6c183a]

[    0.005005] ACPI: Reserving FACS table memory at [mem 0xbcddb000-0xbcddb03f]

[    0.005005] ACPI: Reserving SSDT table memory at [mem 0xbc6f1000-0xbc6fc7b5]

[    0.005006] ACPI: Reserving SSDT table memory at [mem 0xbc6e8000-0xbc6f0ce8]

[    0.005007] ACPI: Reserving SSDT table memory at [mem 0xbc6e4000-0xbc6e7d7b]

[    0.005008] ACPI: Reserving SSDT table memory at [mem 0xbc6e3000-0xbc6e33f0]

[    0.005008] ACPI: Reserving FIDT table memory at [mem 0xbc6db000-0xbc6db09b]

[    0.005009] ACPI: Reserving MCFG table memory at [mem 0xbc6da000-0xbc6da03b]

[    0.005010] ACPI: Reserving HPET table memory at [mem 0xbc6d9000-0xbc6d9037]

[    0.005010] ACPI: Reserving IVRS table memory at [mem 0xbc6d8000-0xbc6d80cf]

[    0.005011] ACPI: Reserving FPDT table memory at [mem 0xbc6d7000-0xbc6d7043]

[    0.005012] ACPI: Reserving TPM2 table memory at [mem 0xbc6d6000-0xbc6d604b]

[    0.005012] ACPI: Reserving PCCT table memory at [mem 0xbc6d5000-0xbc6d506d]

[    0.005013] ACPI: Reserving SSDT table memory at [mem 0xbc6cc000-0xbc6d4212]

[    0.005014] ACPI: Reserving CRAT table memory at [mem 0xbc6ca000-0xbc6cb70f]

[    0.005014] ACPI: Reserving CDIT table memory at [mem 0xbc6c9000-0xbc6c9028]

[    0.005015] ACPI: Reserving SSDT table memory at [mem 0xbc6c8000-0xbc6c8624]

[    0.005016] ACPI: Reserving SSDT table memory at [mem 0xbc6c4000-0xbc6c7a22]

[    0.005017] ACPI: Reserving WSMT table memory at [mem 0xbc6c3000-0xbc6c3027]

[    0.005017] ACPI: Reserving APIC table memory at [mem 0xbc6c2000-0xbc6c215d]

[    0.005018] ACPI: Reserving SSDT table memory at [mem 0xbc6e0000-0xbc6e147e]

[    0.005019] ACPI: Reserving SSDT table memory at [mem 0xbc6df000-0xbc6df0be]

[    0.005271] Early memory node ranges

[    0.033748] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]

[    0.033750] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]

[    0.033751] PM: hibernation: Registered nosave memory: [mem 0x09ccf000-0x09ffffff]

[    0.033753] PM: hibernation: Registered nosave memory: [mem 0x0a200000-0x0a20dfff]

[    0.033754] PM: hibernation: Registered nosave memory: [mem 0x0b000000-0x0b01ffff]

[    0.033756] PM: hibernation: Registered nosave memory: [mem 0xb4c70000-0xb4c70fff]

[    0.033757] PM: hibernation: Registered nosave memory: [mem 0xb4c96000-0xb4c96fff]

[    0.033758] PM: hibernation: Registered nosave memory: [mem 0xb4c97000-0xb4c97fff]

[    0.033759] PM: hibernation: Registered nosave memory: [mem 0xb4ca4000-0xb4ca4fff]

[    0.033760] PM: hibernation: Registered nosave memory: [mem 0xb8224000-0xb8224fff]

[    0.033762] PM: hibernation: Registered nosave memory: [mem 0xb9987000-0xbc596fff]

[    0.033762] PM: hibernation: Registered nosave memory: [mem 0xbc597000-0xbc6fcfff]

[    0.033763] PM: hibernation: Registered nosave memory: [mem 0xbc6fd000-0xbcdf7fff]

[    0.033763] PM: hibernation: Registered nosave memory: [mem 0xbcdf8000-0xbddfefff]

[    0.033765] PM: hibernation: Registered nosave memory: [mem 0xbf000000-0xbfffffff]

[    0.033765] PM: hibernation: Registered nosave memory: [mem 0xc0000000-0xfe9fffff]

[    0.033766] PM: hibernation: Registered nosave memory: [mem 0xfea00000-0xfea0ffff]

[    0.033766] PM: hibernation: Registered nosave memory: [mem 0xfea10000-0xfec0ffff]

[    0.033767] PM: hibernation: Registered nosave memory: [mem 0xfec10000-0xfec10fff]

[    0.033768] PM: hibernation: Registered nosave memory: [mem 0xfec11000-0xfec2ffff]

[    0.033768] PM: hibernation: Registered nosave memory: [mem 0xfec30000-0xfec30fff]

[    0.033769] PM: hibernation: Registered nosave memory: [mem 0xfec31000-0xfecfffff]

[    0.033769] PM: hibernation: Registered nosave memory: [mem 0xfed00000-0xfed00fff]

[    0.033770] PM: hibernation: Registered nosave memory: [mem 0xfed01000-0xfed3ffff]

[    0.033770] PM: hibernation: Registered nosave memory: [mem 0xfed40000-0xfed44fff]

[    0.033771] PM: hibernation: Registered nosave memory: [mem 0xfed45000-0xfed7ffff]

[    0.033771] PM: hibernation: Registered nosave memory: [mem 0xfed80000-0xfed8ffff]

[    0.033772] PM: hibernation: Registered nosave memory: [mem 0xfed90000-0xfedc1fff]

[    0.033773] PM: hibernation: Registered nosave memory: [mem 0xfedc2000-0xfedcffff]

[    0.033773] PM: hibernation: Registered nosave memory: [mem 0xfedd0000-0xfedd3fff]

[    0.033774] PM: hibernation: Registered nosave memory: [mem 0xfedd4000-0xfedd5fff]

[    0.033774] PM: hibernation: Registered nosave memory: [mem 0xfedd6000-0xffffffff]

[    0.307156] Freeing SMP alternatives memory: 48K

[    0.462709] Memory: 32629348K/33450784K available (22528K kernel code, 4521K rwdata, 14928K rodata, 5104K init, 4520K bss, 783312K reserved, 0K cma-reserved)

[    0.463459] x86/mm: Memory block size: 128MB

[    0.721578] Freeing initrd memory: 77676K

[    2.956671] Freeing unused decrypted memory: 2028K

[    2.957408] Freeing unused kernel image (initmem) memory: 5104K

[    2.957990] Freeing unused kernel image (rodata/data gap) memory: 1456K

[    6.143294] nvme nvme0: allocated 16 MiB host memory buffer.

[ 5541.248996] Cannot map memory with base addr 0x795ca4000000 and size of 0x39d280 pages

There is an issue: I killed ollama while a model was loaded, but the RAM hasn't been released, and is now unavailable (it doesn't free up when another process needs it).

So the command sudo sync && sudo echo 3 | sudo tee /proc/sys/vm/drop_caches didn't have any effect.

How to troubleshoot further, something is wrong here.

Excellent troubleshooting and a very interesting problem. You've provided all the right diagnostic information, and it points to a specific and common issue when dealing with applications that manage large chunks of memory, like ollama.

Let's break down the evidence.

Analysis of Your Diagnostics
drop_caches Didn't Work: This is the most important clue. The command echo 3 > /proc/sys/vm/drop_caches tells the kernel to release memory it's holding for its own performance reasons (page cache, dentries, inodes). The fact that this didn't free up a significant amount of RAM means the memory isn't being used as a cache. It is actively allocated by something.

free -h and htop Output:

Mem: 31Gi total, 17Gi used, 1.5Gi free, 12Gi buff/cache, 13Gi available
This shows that while you have 17Gi "used", a large portion of that (12Gi) is in the buff/cache. The kernel normally gives this memory back immediately if an application needs it.
The most important metric here is available: 13Gi. This is the kernel's estimate of how much memory a new process can actually get without causing swapping. This number is healthy.
However, your swap usage is high (2.7G/4.0G), which indicates that at some point, the system was under significant memory pressure.
ipcs -m (The Smoking Gun):

This command lists System V shared memory segments. These are blocks of RAM that can be shared between multiple processes.
You have several large, persistent segments. Look at these two in particular:
shmid 6: 67108864 bytes ≈ 64 MiB
shmid 17: 8294400 bytes ≈ 8.3 MB
shmid 18, shmid 32817, shmid 57: 4194304 bytes each ≈ 4.2 MB each
When a program like ollama is killed forcefully (kill -9) or crashes, it doesn't get a chance to run its cleanup routines. These routines are responsible for releasing shared memory segments. The result is that the segments become "orphaned"—they still exist and occupy RAM, but no active process is attached to them (or the nattch count is incorrect).
dmesg Output:

The line [ 5541.248996] Cannot map memory with base addr ... is very telling. This is a kernel-level error indicating that some process (likely related to your GPU and ollama) tried to allocate a massive, contiguous block of memory and failed. This can happen when memory is fragmented or when a driver has issues. It supports the theory that ollama was doing some heavy, low-level memory management that got interrupted.
The Problem: Orphaned Shared Memory
The evidence strongly suggests that when you killed ollama, it left behind one or more large shared memory segments. These segments are not part of the kernel's cache, so drop_caches won't touch them. They are treated as actively used memory until you manually remove them.

ollama likely uses these segments to hold the model weights in RAM for fast access, possibly sharing them between a main process and worker processes or for direct access by a GPU driver.

How to Fix It
You need to manually remove the orphaned shared memory segments using the ipcrm command.

Identify the Segments to Remove

Look at the output of your ipcs -m command. The segments are identified by their shmid. You need to tell ipcrm which segment to remove. The ones associated with ollama are likely the largest ones.

Remove the Segments

Use the following command format:
sudo ipcrm -m

Based on your output, you should try removing the largest segments first.

Bash

Remove the ~64MB segment
sudo ipcrm -m 6

Remove the ~8.3MB segment
sudo ipcrm -m 17

Remove the other large segments if the memory is still not free
sudo ipcrm -m 18
sudo ipcrm -m 32817
sudo ipcrm -m 57
... and so on for any other suspicious segments

Verify the Fix

After running the ipcrm commands, check your memory usage again with free -h. You should see the used memory decrease and free or available memory increase by the size of the segments you removed.

Further Troubleshooting and Prevention
Check GPU Memory: If ipcrm doesn't solve it, the memory might be stuck on your GPU. ollama heavily utilizes VRAM.

If you have an NVIDIA GPU: nvidia-smi
If you have an AMD GPU: rocm-smi
Look for processes that are no longer running but still have memory allocated on the GPU. A reboot is often the only way to clear this if the driver is stuck.
Use a "Gentler" Kill Signal: To prevent this from happening again, avoid using kill -9 (SIGKILL). Instead, use the default kill command which sends SIGTERM (signal 15).

kill <process_id_of_ollama>
This gives the application a chance to shut down gracefully, which includes releasing its shared memory segments.
Check for HugePages: High-performance applications sometimes use HugePages, which are large, pre-allocated chunks of RAM that don't always show up in the standard free command's used count and are not cache.

Check for them with: grep Huge /proc/meminfo
If you see a large Hugetotal and HugeFree that are not equal, memory might be tied up there. Clearing these usually requires a reboot.
In summary, you've done a great job diagnosing the issue. The culprit is almost certainly the orphaned shared memory segments shown by ipcs -m. Use ipcrm to clean them up.

No but look: the segments found with icps -m are nowhere near the 15G still used.

You are absolutely right to point that out. My apologies. You've correctly identified that the few hundred megabytes in ipcs -m don't account for the gigabytes of memory in question. This effectively rules out System V shared memory as the primary issue.

Let's look again at your free -h output, because it holds the real answer.

total used free shared buff/cache available
Mem: 31Gi 17Gi 1,5Gi 49Mi 12Gi 13Gi
Swap: 4,0Gi 2,7Gi 1,3Gi
The Key is buff/cache vs. used
The core of the issue lies in the definition of "used" memory in Linux. You see 17Gi in the used column, which is alarming. However, you also see 12Gi in the buff/cache column.

Here is what that means:

used (17Gi): This is the total memory that is not free. It's a simple calculation: total - free. Crucially, this includes the buffer and cache.
buff/cache (12Gi): This is the key. This is RAM that the Linux kernel is using for disk caching (page cache) and buffering. It keeps recently accessed files in this RAM to make subsequent reads from the disk incredibly fast. This is good and normal behavior.
available (13Gi): This is the most important number. It's the kernel's estimate of how much memory is truly available for starting new applications without swapping. It is essentially free memory plus the buff/cache memory that can be reclaimed instantly.
The "missing" 12-15Gi of RAM is not actually held by a dead process. It's being used by the Linux kernel as a disk cache.

Why did drop_caches seem to fail?
This is the confusing part. The command sudo echo 3 | sudo tee /proc/sys/vm/drop_caches is supposed to clear exactly this cache. When you run it, you should see the buff/cache number plummet, and the free number shoot up by the same amount.

There are a few possibilities for why it didn't seem to work:

Observation Timing: The kernel is very aggressive about caching. If you have any disk I/O happening, the kernel might have started filling the cache again the instant you cleared it, making it look like the command had no effect.
Unreclaimable Cache: A small portion of the cache might be "dirty" (data waiting to be written to disk) or tied to a process in a specific state (like a memory-mapped file from a zombie process). The sync command before drop_caches is meant to prevent this by flushing dirty buffers, but sometimes edge cases occur.
Slab Allocations: drop_caches with a value of 3 frees page cache, dentries, and inodes. The slab memory used by kernel drivers is also part of this, but some slab objects may not be reclaimable. Looking at your slabinfo, nothing stands out as being gigabytes in size, so this is less likely.
Let's Do a Definitive Test
To confirm that this memory is just cache and is fully available, please try this sequence of commands. This will show the "before and after" state clearly.

Check memory before:

Bash

free -h
(Note the numbers in the used, free, and buff/cache columns).

Run the sync and drop caches command again:

Bash

sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3
(Using <<< is just a slightly different way of piping the input, functionally identical for this purpose).

Immediately check memory after:

Bash

sleep 1; free -h
(Note the new numbers. You should see buff/cache drop to near zero, free increase by ~12Gi, and used decrease by ~12Gi).

What to expect: You will see that the memory moves from the buff/cache column to the free column. This proves the memory was never "unavailable" to begin with; it was just being used intelligently by the kernel.

The ultimate test: Try to launch a memory-intensive application (you could even start ollama again with a model). You will see the kernel surrender the cache memory to the new application on demand. The buff/cache value will decrease as your application's memory (RES in htop) increases.


~ ❯❯❯ free -h                                                  (base) 

               total        used        free      shared  buff/cache   available

Mem:            31Gi        19Gi       652Mi       103Mi        11Gi        11Gi

Swap:          4,0Gi       2,7Gi       1,3Gi

~ ❯❯❯ bash - "sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3"

bash: sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3: No such file or directory

~ ❯❯❯ bash -c "sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3"

3

~ ❯❯❯ sleep 1; free -h                                         (base) 

               total        used        free      shared  buff/cache   available

Mem:            31Gi        19Gi        11Gi       113Mi       1,1Gi        11Gi

Swap:          4,0Gi       2,7Gi       1,3Gi

Starting ollama again and try running another model indeed fails cause OOM. It can't use the used RAM.

This is an excellent and crucial result. Thank you for running that test. You have 100% confirmed that this is not a simple caching issue and there is a genuine problem.

Let's break down the "after" free -h output, because it's incredibly revealing:

total used free shared buff/cache available
Mem: 31Gi 19Gi 11Gi 113Mi 1,1Gi 11Gi
buff/cache Dropped: You can see that buff/cache went from 11Gi down to 1.1Gi. This means the drop_caches command did work. It successfully freed ~10Gi of disk cache.
free Increased: The free memory went from 652Mi up to 11Gi. This is where the cleared cache went.
used Did Not Change: This is the most important clue. The used memory stayed at a whopping 19Gi.
Your conclusion is exactly right: "It can't use the used RAM."

New Diagnosis: Hidden Memory Allocation (Kernel or Driver Leak)
We have now proven that of your 19Gi of used RAM:

Only ~1.1Gi is disk cache.
The remaining ~18Gi is actively held by something that is not a normal user-space process visible in htop.
This points strongly to memory being allocated and held at the kernel level. When ollama was killed, it seems its corresponding kernel-side allocations, likely within a driver, were not released.

The number one suspect is the GPU driver.

Modern AI frameworks like ollama don't just use the GPU's own VRAM; they ask the GPU driver to "pin" or "map" large amounts of system RAM for high-speed transfers to the GPU. When the user-space process dies unexpectedly, this pinned memory can get "leaked" or orphaned within the driver. It won't be released until the driver itself is reset.

How to Troubleshoot Further (The Real Culprit)
Let's find exactly where that memory is.

Step 1: Check GPU Driver Memory (Highest Priority)
This is the most likely place to find the missing gigabytes. Please run the command corresponding to your GPU vendor.

For NVIDIA GPUs:

Bash

nvidia-smi
For AMD GPUs:

Bash

rocm-smi
Look at the memory usage table in the output. I suspect you will see a massive amount of memory reported as used, even though there are no applications listed in the "Processes" section. This would be definitive proof of a driver-level memory leak.

Step 2: Check for Kernel Slab Allocations with slabtop
If nvidia-smi or rocm-smi don't show anything obvious, the next step is to check the kernel's own memory pools (the "slab").

Run the following command for about 10-15 seconds, then press q to quit. It sorts by the most memory-intensive objects.

Bash

sudo slabtop -s c
Look for any entry at the top of the list that is consuming an unusual amount of memory (multiple gigabytes). The name of the object (e.g., something with dma, iommu, nvidia, amdgpu in the name) can tell us which driver is responsible.

Step 3: Hunt for Zombie Processes
This is less likely but possible. A zombie process (Z state) can sometimes hold onto resources.

Bash

ps aux | grep ' Z'
If you see any processes listed, it could be a contributing factor.

The Solution
Regardless of what the diagnostics above show, the underlying problem is a resource leak caused by the improper termination of ollama.

Immediate Fix (Recommended): A reboot is the cleanest, safest, and most guaranteed way to fix this. It will force the kernel and all its drivers to re-initialize from a clean state, releasing all orphaned memory.

Advanced Fix (Use with Caution): If you can confirm it's a GPU driver leak (e.g., via nvidia-smi), you could try to unload and reload the kernel module for your GPU. This is complex and can destabilize your system, requiring a reboot anyway. A reboot is far more reliable.

Please post the output of nvidia-smi (or rocm-smi) and slabtop. This will give us the final piece of the puzzle.


~ ❯❯❯ nvidia-smi                                                                                                (base) 

Fri Jun 20 19:48:20 2025       

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |

|-----------------------------------------+------------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |

|                                         |                        |               MIG M. |

|=========================================+========================+======================|

|   0  NVIDIA GeForce RTX 3090        On  |   00000000:06:00.0  On |                  N/A |

|  0%   54C    P5             48W /  380W |     397MiB /  24576MiB |      1%      Default |

|                                         |                        |                  N/A |

+-----------------------------------------+------------------------+----------------------+

                                                                                         

+-----------------------------------------------------------------------------------------+

| Processes:                                                                              |

|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |

|        ID   ID                                                               Usage      |

|=========================================================================================|

|    0   N/A  N/A            1594      G   /usr/lib/xorg/Xorg                      153MiB |

|    0   N/A  N/A            2496      G   cinnamon                                 59MiB |

|    0   N/A  N/A            7223      G   ...1873736b243b2831076d864fd0435         34MiB |

|    0   N/A  N/A           17766      G   /usr/share/codium/codium                 44MiB |

|    0   N/A  N/A           24901      G   ...led --variations-seed-version         50MiB |

--> nope


~ ❯❯❯ sudo slabtop -s c



 Active / Total Objects (% used)    : 1656679 / 1889888 (87,7%)

 Active / Total Slabs (% used)      : 34548 / 34548 (100,0%)

 Active / Total Caches (% used)     : 359 / 460 (78,0%)

 Active / Total Size (% used)       : 307307,50K / 390599,47K (78,7%)

 Minimum / Average / Maximum Object : 0,01K / 0,21K / 30,52K



  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   

 73108  20213  27%    0,57K   2611       28     41776K radix_tree_node

  2505   2493  99%   10,25K    835        3     26720K task_struct

115676 114072  98%    0,18K   2629       44     21032K vm_area_struct

 16821   4672  27%    1,15K    623       27     19936K ext4_inode_cache

 38240  10742  28%    0,50K   1195       32     19120K kmalloc-512

 70686  20943  29%    0,19K   1683       42     13464K dentry

 75990  75990 100%    0,13K   2533       30     10132K kernfs_node_cache

  9856   9688  98%    1,00K    308       32      9856K iommu_iova_magazine

 13107   9411  71%    0,62K    257       51      8224K inode_cache

 12480  12322  98%    0,50K    390       32      6240K kmalloc-rnd-05-512

 58800  21770  37%    0,09K   1400       42      5600K trace_event_file

 21632   6532  30%    0,25K    676       32      5408K kmalloc-256

  6854   4307  62%    0,69K    149       46      4768K proc_inode_cache

 74560  72560  97%    0,06K   1165       64      4660K anon_vma_chain

117198 114623  97%    0,04K   1149      102      4596K vma_lock

  4576   4165  91%    1,00K    143       32      4576K kmalloc-rnd-04-1k

  5658   5483  96%    0,77K    138       41      4416K shmem_inode_cache

 40716  40090  98%    0,10K   1044       39      4176K anon_vma

 15968  15521  97%    0,25K    499       32      3992K filp

 54400  49227  90%    0,06K    850       64      3400K dmaengine-unmap-2

 12864  11684  90%    0,25K    402       32      3216K maple_node

  1584   1535  96%    2,00K     99       16      3168K kmalloc-cg-2k

   728    704  96%    4,00K     91        8      2912K biovec-max

   344    321  93%    8,00K     86        4      2752K kmalloc-rnd-04-8k

   672    650  96%    4,00K     84        8      2688K kmalloc-cg-4k

   664    640  96%    4,00K     83        8      2656K kmalloc-rnd-06-4k

  1264   1214  96%    2,00K     79       16      2528K kmalloc-rnd-04-2k

   155    143  92%   14,17K     78        2      2496K uvm_va_range_external_t

  4960   4631  93%    0,50K    155       32      2480K kmalloc-rnd-04-512

  4736   4608  97%    0,50K    148       32      2368K pool_workqueue

  2304   2304 100%    1,00K     72       32      2304K kmalloc-cg-1k

 32032  28484  88%    0,07K    572       56      2288K vmap_area

  1035   1035 100%    2,06K     69       15      2208K sighand_cache

  3082   2944  95%    0,69K     67       46      2144K skbuff_small_head

   268    268 100%    8,00K     67        4      2144K kmalloc-rnd-10-8k

  2574   2574 100%    0,81K     66       39      2112K sock_inode_cache

  1024   1024 100%    2,00K     64       16      2048K kmalloc-rnd-12-2k

  2048   2042  99%    1,00K     64       32      2048K kmalloc-1k

  1624   1624 100%    1,12K     58       28      1856K signal_cache

  1824   1824 100%    1,00K     57       32      1824K PING

  1792   1728  96%    1,00K     56       32      1792K kmalloc-rnd-08-1k

   440    440 100%    4,00K     55        8      1760K kmalloc-rnd-12-4k

   440    399  90%    4,00K     55        8      1760K kmalloc-rnd-04-4k

  9504   9504 100%    0,18K    216       44      1728K ext4_groupinfo_4k

--> nope


~ ❯❯❯ ps aux | grep ' Z'                                                                                                                                                              (base) 

user       26721  0.0  0.0   9284  2064 pts/4    S+   19:50   0:00 grep --color=auto  Z

--> nope

This is absolutely fascinating. Your diagnostic work has been perfect, and the results have allowed us to eliminate every single one of the usual suspects.

Let's recap what we now know for certain:

It's NOT Disk Cache: Your drop_caches test proved this. You cleared ~10GB of cache, but the used memory stayed stubbornly high at 19GB, and the system remained OOM.
It's NOT GPU VRAM: Your nvidia-smi output is crystal clear. The GPU's own 24GB of memory is almost completely free (397MiB used).
It's NOT a Kernel Slab Leak: Your slabtop output shows total slab usage is only a few hundred megabytes, orders of magnitude less than the ~18GB of memory we are hunting for.
It's NOT Zombie Processes: Your ps command showed no zombies.
You have methodically and successfully cornered a very rare and difficult-to-diagnose type of memory leak.

Final Diagnosis: Orphaned Pinned System Memory
With all other possibilities exhausted, we are left with the most likely culprit: A leak of "pinned" system RAM within the NVIDIA driver itself.

Here's what that means:

To achieve the highest performance, ollama asks the NVIDIA driver to take a large chunk of your regular system RAM (your 32GB) and "pin" it.
Pinning memory locks it in place so the GPU can access it directly for high-speed data transfers (DMA). This is a core feature of CUDA and is essential for performance.
This pinned RAM is not part of the GPU's VRAM, so it doesn't show up in nvidia-smi. It is not a standard kernel cache, so slabtop doesn't see it and drop_caches can't touch it.
When you killed ollama, the user-space process vanished, but a bug or an unhandled state within the NVIDIA kernel driver caused it to fail to un-pin those gigabytes of system RAM.
The memory is now in a black hole: the kernel knows it's allocated and won't give it to other processes, but it doesn't belong to any running process, so no tool can account for it. It is, for all practical purposes, leaked by the driver until that driver is reset.

The Solution: Reboot
You have done all the troubleshooting that can be reasonably done on a live system. You have proven where the problem isn't, and in doing so, have isolated it to a driver-state issue.

The necessary and correct solution now is to reboot the system.

A reboot is the only way to guarantee that the kernel completely discards the faulty memory state and forces the NVIDIA driver to be reloaded from a clean slate. This will release the orphaned pinned memory and return your system to a healthy state.

This was an excellent deep-dive into Linux memory management. The key takeaway for the future is that forcefully killing GPU-intensive applications with kill -9 carries a risk of leaving the driver in an inconsistent state, and a graceful shutdown should always be preferred if possible.

@rick-github In all case, thanks for all your suggestions up to now
Maybe before trying to dig even deeper, even if I don't know what to try next, maybe the best trial now would be to at least know if it's replicable by anyone. The model I end up trying was unsloth/Kimi-Dev-72B-GGUF:Q3_K_M

@blakkd commented on GitHub (Jun 20, 2025): I tried getting help from Qwen3 32B, AM-thinking, and to finish dumbly following Gemini 2.5 pro instructions and: no clue, all end up concluding the only solution is to reboot. Here are all the commands I tried along with my prompts and responses, I tried to format it well to relieve the a pain to read as much as possible, cause I'm conscious it's a huge message... It was during my last attempt, using Gemini 2.5 pro. All the commands have been ran after CTRL-C to SIGINT ollama when the model was successfully loaded and with 48/81 layers on VRAM so 33/81 on RAM (later tried with 50/81 VRAM - 31/81 RAM and to finish 53/81 VRAM - 28/81 RAM which lead to a `cudaMalloc failed: out of memory`, same: VRAM released, but RAM only partially). Also, a precision: loading Qwen3-32B the same way: split on RAM/VRAM, or either entirely on RAM, then SIGINT ollama doesn't triggers this RAM not released entirely behavior that is occuring with Kimi-dev. I actually never had any issue around with any model apart with this specific one. ``` ~ ❯❯❯ sudo sync && sudo echo 3 | sudo tee /proc/sys/vm/drop_caches [sudo] password for user: 3 ``` ``` 0[|0.6%] 4[ 0.0%] 8[|3.6%] 12[|0.6%] 16[ 0.0%] 20[ 0.0%] 24[ 0.0%] 28[ 0.0%] 1[|4.9%] 5[ 0.0%] 9[|0.6%] 13[ 0.0%] 17[|6.1%] 21[ 0.0%] 25[ 0.0%] 29[|0.6%] 2[|9.8%] 6[ 0.0%] 10[|0.6%] 14[ 0.0%] 18[ 0.0%] 22[ 0.0%] 26[ 0.0%] 30[|0.6%] 3[|1.2%] 7[|0.6%] 11[|0.6%] 15[ 0.0%] 19[ 0.0%] 23[ 0.0%] 27[|0.6%] 31[|1.2%] Mem[||||||||||||||||||||||||17.2G/31.2G] Tasks: 174, 1484 thr, 373 kthr; 1 running Swp[||||||||||||||||||||||||2.65G/4.00G] Load average: 0.37 0.53 0.63 Uptime: 01:54:32 [Main] [I/O] PID USER PRI NI VIRT RES SHR S CPU% MEM%▽ TIME+ Command 7355 user 20 0 1392G 329M 111M S 0.0 1.0 1:56.50 /opt/brave.com/brav 7357 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.01 /opt/brave.com/brav 7358 user 20 0 1392G 329M 111M S 0.0 1.0 0:48.45 /opt/brave.com/brav 7359 user 20 0 1392G 329M 111M S 0.0 1.0 0:06.39 /opt/brave.com/brav 7360 user 20 0 1392G 329M 111M S 0.0 1.0 0:10.91 /opt/brave.com/brav 7361 user 20 0 1392G 329M 111M S 0.0 1.0 0:31.34 /opt/brave.com/brav 7362 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.00 /opt/brave.com/brav 7363 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.00 /opt/brave.com/brav 7380 user 20 0 1392G 329M 111M S 0.0 1.0 0:05.55 /opt/brave.com/brav 7381 user 20 0 1392G 329M 111M S 0.0 1.0 0:05.29 /opt/brave.com/brav 7382 user 20 0 1392G 329M 111M S 0.0 1.0 0:05.18 /opt/brave.com/brav 7478 user 20 0 1392G 329M 111M S 0.0 1.0 0:05.76 /opt/brave.com/brav 7479 user 20 0 1392G 329M 111M S 0.0 1.0 0:03.98 /opt/brave.com/brav 7480 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.56 /opt/brave.com/brav 7481 user 20 0 1392G 329M 111M S 0.0 1.0 0:02.90 /opt/brave.com/brav 7482 user 20 0 1392G 329M 111M S 0.0 1.0 0:01.33 /opt/brave.com/brav 7483 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.71 /opt/brave.com/brav 7561 user 20 0 1392G 329M 111M S 0.0 1.0 0:00.49 /opt/brave.com/brav ``` ``` ~ ❯❯❯ free -h (base) cat /proc/meminfo | grep -i mem total used free shared buff/cache available Mem: 31Gi 17Gi 1,5Gi 49Mi 12Gi 13Gi Swap: 4,0Gi 2,7Gi 1,3Gi MemTotal: 32753744 kB MemFree: 1561404 kB MemAvailable: 14226864 kB Shmem: 50440 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB ``` ``` ~ ❯❯❯ sudo cat /proc/slabinfo | grep -i mem (base) memdup_user-8k 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0 memdup_user-4k 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0 memdup_user-2k 272 272 2048 16 8 : tunables 0 0 0 : slabdata 17 17 0 memdup_user-1k 544 544 1024 32 8 : tunables 0 0 0 : slabdata 17 17 0 memdup_user-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0 memdup_user-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0 memdup_user-128 480 480 128 32 1 : tunables 0 0 0 : slabdata 15 15 0 memdup_user-64 1728 1728 64 64 1 : tunables 0 0 0 : slabdata 27 27 0 memdup_user-32 4096 4096 32 128 1 : tunables 0 0 0 : slabdata 32 32 0 memdup_user-16 8192 8192 16 256 1 : tunables 0 0 0 : slabdata 32 32 0 memdup_user-8 16384 16384 8 512 1 : tunables 0 0 0 : slabdata 32 32 0 memdup_user-192 420 420 192 42 2 : tunables 0 0 0 : slabdata 10 10 0 memdup_user-96 882 882 96 42 1 : tunables 0 0 0 : slabdata 21 21 0 shmem_inode_cache 5490 5658 784 41 8 : tunables 0 0 0 : slabdata 138 138 0 kmem_cache_node 1344 1344 64 64 1 : tunables 0 0 0 : slabdata 21 21 0 kmem_cache 896 896 256 32 2 : tunables 0 0 0 : slabdata 28 28 0 ``` ``` ~ ❯❯❯ ipcs -m (base) ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 32770 user 600 524288 2 dest 0x00000000 6 user 600 67108864 2 dest 0x00000000 9 user 600 524288 2 dest 0x00000000 10 user 600 524288 2 dest 0x00000000 17 user 777 8294400 2 0x00000000 18 user 600 4194304 2 dest 0x00000000 32815 user 600 524288 2 dest 0x00000000 32816 user 600 524288 2 dest 0x00000000 32817 user 600 4194304 2 dest 0x00000000 54 user 600 524288 2 dest 0x00000000 57 user 600 4194304 2 dest ``` ``` ~ ❯❯❯ dmesg | grep -i memory (base) dmesg | grep -i oom [ 0.000000] DMI: Memory slots populated: 4/4 [ 0.005003] ACPI: Reserving FACP table memory at [mem 0xbc6e2000-0xbc6e2113] [ 0.005004] ACPI: Reserving DSDT table memory at [mem 0xbc6bb000-0xbc6c183a] [ 0.005005] ACPI: Reserving FACS table memory at [mem 0xbcddb000-0xbcddb03f] [ 0.005005] ACPI: Reserving SSDT table memory at [mem 0xbc6f1000-0xbc6fc7b5] [ 0.005006] ACPI: Reserving SSDT table memory at [mem 0xbc6e8000-0xbc6f0ce8] [ 0.005007] ACPI: Reserving SSDT table memory at [mem 0xbc6e4000-0xbc6e7d7b] [ 0.005008] ACPI: Reserving SSDT table memory at [mem 0xbc6e3000-0xbc6e33f0] [ 0.005008] ACPI: Reserving FIDT table memory at [mem 0xbc6db000-0xbc6db09b] [ 0.005009] ACPI: Reserving MCFG table memory at [mem 0xbc6da000-0xbc6da03b] [ 0.005010] ACPI: Reserving HPET table memory at [mem 0xbc6d9000-0xbc6d9037] [ 0.005010] ACPI: Reserving IVRS table memory at [mem 0xbc6d8000-0xbc6d80cf] [ 0.005011] ACPI: Reserving FPDT table memory at [mem 0xbc6d7000-0xbc6d7043] [ 0.005012] ACPI: Reserving TPM2 table memory at [mem 0xbc6d6000-0xbc6d604b] [ 0.005012] ACPI: Reserving PCCT table memory at [mem 0xbc6d5000-0xbc6d506d] [ 0.005013] ACPI: Reserving SSDT table memory at [mem 0xbc6cc000-0xbc6d4212] [ 0.005014] ACPI: Reserving CRAT table memory at [mem 0xbc6ca000-0xbc6cb70f] [ 0.005014] ACPI: Reserving CDIT table memory at [mem 0xbc6c9000-0xbc6c9028] [ 0.005015] ACPI: Reserving SSDT table memory at [mem 0xbc6c8000-0xbc6c8624] [ 0.005016] ACPI: Reserving SSDT table memory at [mem 0xbc6c4000-0xbc6c7a22] [ 0.005017] ACPI: Reserving WSMT table memory at [mem 0xbc6c3000-0xbc6c3027] [ 0.005017] ACPI: Reserving APIC table memory at [mem 0xbc6c2000-0xbc6c215d] [ 0.005018] ACPI: Reserving SSDT table memory at [mem 0xbc6e0000-0xbc6e147e] [ 0.005019] ACPI: Reserving SSDT table memory at [mem 0xbc6df000-0xbc6df0be] [ 0.005271] Early memory node ranges [ 0.033748] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff] [ 0.033750] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.033751] PM: hibernation: Registered nosave memory: [mem 0x09ccf000-0x09ffffff] [ 0.033753] PM: hibernation: Registered nosave memory: [mem 0x0a200000-0x0a20dfff] [ 0.033754] PM: hibernation: Registered nosave memory: [mem 0x0b000000-0x0b01ffff] [ 0.033756] PM: hibernation: Registered nosave memory: [mem 0xb4c70000-0xb4c70fff] [ 0.033757] PM: hibernation: Registered nosave memory: [mem 0xb4c96000-0xb4c96fff] [ 0.033758] PM: hibernation: Registered nosave memory: [mem 0xb4c97000-0xb4c97fff] [ 0.033759] PM: hibernation: Registered nosave memory: [mem 0xb4ca4000-0xb4ca4fff] [ 0.033760] PM: hibernation: Registered nosave memory: [mem 0xb8224000-0xb8224fff] [ 0.033762] PM: hibernation: Registered nosave memory: [mem 0xb9987000-0xbc596fff] [ 0.033762] PM: hibernation: Registered nosave memory: [mem 0xbc597000-0xbc6fcfff] [ 0.033763] PM: hibernation: Registered nosave memory: [mem 0xbc6fd000-0xbcdf7fff] [ 0.033763] PM: hibernation: Registered nosave memory: [mem 0xbcdf8000-0xbddfefff] [ 0.033765] PM: hibernation: Registered nosave memory: [mem 0xbf000000-0xbfffffff] [ 0.033765] PM: hibernation: Registered nosave memory: [mem 0xc0000000-0xfe9fffff] [ 0.033766] PM: hibernation: Registered nosave memory: [mem 0xfea00000-0xfea0ffff] [ 0.033766] PM: hibernation: Registered nosave memory: [mem 0xfea10000-0xfec0ffff] [ 0.033767] PM: hibernation: Registered nosave memory: [mem 0xfec10000-0xfec10fff] [ 0.033768] PM: hibernation: Registered nosave memory: [mem 0xfec11000-0xfec2ffff] [ 0.033768] PM: hibernation: Registered nosave memory: [mem 0xfec30000-0xfec30fff] [ 0.033769] PM: hibernation: Registered nosave memory: [mem 0xfec31000-0xfecfffff] [ 0.033769] PM: hibernation: Registered nosave memory: [mem 0xfed00000-0xfed00fff] [ 0.033770] PM: hibernation: Registered nosave memory: [mem 0xfed01000-0xfed3ffff] [ 0.033770] PM: hibernation: Registered nosave memory: [mem 0xfed40000-0xfed44fff] [ 0.033771] PM: hibernation: Registered nosave memory: [mem 0xfed45000-0xfed7ffff] [ 0.033771] PM: hibernation: Registered nosave memory: [mem 0xfed80000-0xfed8ffff] [ 0.033772] PM: hibernation: Registered nosave memory: [mem 0xfed90000-0xfedc1fff] [ 0.033773] PM: hibernation: Registered nosave memory: [mem 0xfedc2000-0xfedcffff] [ 0.033773] PM: hibernation: Registered nosave memory: [mem 0xfedd0000-0xfedd3fff] [ 0.033774] PM: hibernation: Registered nosave memory: [mem 0xfedd4000-0xfedd5fff] [ 0.033774] PM: hibernation: Registered nosave memory: [mem 0xfedd6000-0xffffffff] [ 0.307156] Freeing SMP alternatives memory: 48K [ 0.462709] Memory: 32629348K/33450784K available (22528K kernel code, 4521K rwdata, 14928K rodata, 5104K init, 4520K bss, 783312K reserved, 0K cma-reserved) [ 0.463459] x86/mm: Memory block size: 128MB [ 0.721578] Freeing initrd memory: 77676K [ 2.956671] Freeing unused decrypted memory: 2028K [ 2.957408] Freeing unused kernel image (initmem) memory: 5104K [ 2.957990] Freeing unused kernel image (rodata/data gap) memory: 1456K [ 6.143294] nvme nvme0: allocated 16 MiB host memory buffer. [ 5541.248996] Cannot map memory with base addr 0x795ca4000000 and size of 0x39d280 pages ``` **_There is an issue: I killed ollama while a model was loaded, but the RAM hasn't been released, and is now unavailable (it doesn't free up when another process needs it)._** **_So the command `sudo sync && sudo echo 3 | sudo tee /proc/sys/vm/drop_caches` didn't have any effect._** **_How to troubleshoot further, something is wrong here._** > Excellent troubleshooting and a very interesting problem. You've provided all the right diagnostic information, and it points to a specific and common issue when dealing with applications that manage large chunks of memory, like ollama. > > Let's break down the evidence. > > Analysis of Your Diagnostics > drop_caches Didn't Work: This is the most important clue. The command echo 3 > /proc/sys/vm/drop_caches tells the kernel to release memory it's holding for its own performance reasons (page cache, dentries, inodes). The fact that this didn't free up a significant amount of RAM means the memory isn't being used as a cache. It is actively allocated by something. > > free -h and htop Output: > > Mem: 31Gi total, 17Gi used, 1.5Gi free, 12Gi buff/cache, 13Gi available > This shows that while you have 17Gi "used", a large portion of that (12Gi) is in the buff/cache. The kernel normally gives this memory back immediately if an application needs it. > The most important metric here is available: 13Gi. This is the kernel's estimate of how much memory a new process can actually get without causing swapping. This number is healthy. > However, your swap usage is high (2.7G/4.0G), which indicates that at some point, the system was under significant memory pressure. > ipcs -m (The Smoking Gun): > > This command lists System V shared memory segments. These are blocks of RAM that can be shared between multiple processes. > You have several large, persistent segments. Look at these two in particular: > shmid 6: 67108864 bytes ≈ 64 MiB > shmid 17: 8294400 bytes ≈ 8.3 MB > shmid 18, shmid 32817, shmid 57: 4194304 bytes each ≈ 4.2 MB each > When a program like ollama is killed forcefully (kill -9) or crashes, it doesn't get a chance to run its cleanup routines. These routines are responsible for releasing shared memory segments. The result is that the segments become "orphaned"—they still exist and occupy RAM, but no active process is attached to them (or the nattch count is incorrect). > dmesg Output: > > The line [ 5541.248996] Cannot map memory with base addr ... is very telling. This is a kernel-level error indicating that some process (likely related to your GPU and ollama) tried to allocate a massive, contiguous block of memory and failed. This can happen when memory is fragmented or when a driver has issues. It supports the theory that ollama was doing some heavy, low-level memory management that got interrupted. > The Problem: Orphaned Shared Memory > The evidence strongly suggests that when you killed ollama, it left behind one or more large shared memory segments. These segments are not part of the kernel's cache, so drop_caches won't touch them. They are treated as actively used memory until you manually remove them. > > ollama likely uses these segments to hold the model weights in RAM for fast access, possibly sharing them between a main process and worker processes or for direct access by a GPU driver. > > How to Fix It > You need to manually remove the orphaned shared memory segments using the ipcrm command. > > 1. Identify the Segments to Remove > > Look at the output of your ipcs -m command. The segments are identified by their shmid. You need to tell ipcrm which segment to remove. The ones associated with ollama are likely the largest ones. > > 2. Remove the Segments > > Use the following command format: > sudo ipcrm -m <shmid> > > Based on your output, you should try removing the largest segments first. > > Bash > > **Remove the ~64MB segment** > sudo ipcrm -m 6 > > **Remove the ~8.3MB segment** > sudo ipcrm -m 17 > > **Remove the other large segments if the memory is still not free** > sudo ipcrm -m 18 > sudo ipcrm -m 32817 > sudo ipcrm -m 57 > **... and so on for any other suspicious segments** > > 3. Verify the Fix > > After running the ipcrm commands, check your memory usage again with free -h. You should see the used memory decrease and free or available memory increase by the size of the segments you removed. > > Further Troubleshooting and Prevention > Check GPU Memory: If ipcrm doesn't solve it, the memory might be stuck on your GPU. ollama heavily utilizes VRAM. > > If you have an NVIDIA GPU: nvidia-smi > If you have an AMD GPU: rocm-smi > Look for processes that are no longer running but still have memory allocated on the GPU. A reboot is often the only way to clear this if the driver is stuck. > Use a "Gentler" Kill Signal: To prevent this from happening again, avoid using kill -9 (SIGKILL). Instead, use the default kill command which sends SIGTERM (signal 15). > > kill <process_id_of_ollama> > This gives the application a chance to shut down gracefully, which includes releasing its shared memory segments. > Check for HugePages: High-performance applications sometimes use HugePages, which are large, pre-allocated chunks of RAM that don't always show up in the standard free command's used count and are not cache. > > Check for them with: grep Huge /proc/meminfo > If you see a large Hugetotal and HugeFree that are not equal, memory might be tied up there. Clearing these usually requires a reboot. > In summary, you've done a great job diagnosing the issue. The culprit is almost certainly the orphaned shared memory segments shown by ipcs -m. Use ipcrm to clean them up. > **_No but look: the segments found with `icps -m` are nowhere near the 15G still used._** > You are absolutely right to point that out. My apologies. You've correctly identified that the few hundred megabytes in ipcs -m don't account for the gigabytes of memory in question. This effectively rules out System V shared memory as the primary issue. > > Let's look again at your free -h output, because it holds the real answer. > > total used free shared buff/cache available > Mem: 31Gi 17Gi 1,5Gi 49Mi 12Gi 13Gi > Swap: 4,0Gi 2,7Gi 1,3Gi > The Key is buff/cache vs. used > The core of the issue lies in the definition of "used" memory in Linux. You see 17Gi in the used column, which is alarming. However, you also see 12Gi in the buff/cache column. > > Here is what that means: > > used (17Gi): This is the total memory that is not free. It's a simple calculation: total - free. Crucially, this includes the buffer and cache. > buff/cache (12Gi): This is the key. This is RAM that the Linux kernel is using for disk caching (page cache) and buffering. It keeps recently accessed files in this RAM to make subsequent reads from the disk incredibly fast. This is good and normal behavior. > available (13Gi): This is the most important number. It's the kernel's estimate of how much memory is truly available for starting new applications without swapping. It is essentially free memory plus the buff/cache memory that can be reclaimed instantly. > The "missing" 12-15Gi of RAM is not actually held by a dead process. It's being used by the Linux kernel as a disk cache. > > Why did drop_caches seem to fail? > This is the confusing part. The command sudo echo 3 | sudo tee /proc/sys/vm/drop_caches is supposed to clear exactly this cache. When you run it, you should see the buff/cache number plummet, and the free number shoot up by the same amount. > > There are a few possibilities for why it didn't seem to work: > > Observation Timing: The kernel is very aggressive about caching. If you have any disk I/O happening, the kernel might have started filling the cache again the instant you cleared it, making it look like the command had no effect. > Unreclaimable Cache: A small portion of the cache might be "dirty" (data waiting to be written to disk) or tied to a process in a specific state (like a memory-mapped file from a zombie process). The sync command before drop_caches is meant to prevent this by flushing dirty buffers, but sometimes edge cases occur. > Slab Allocations: drop_caches with a value of 3 frees page cache, dentries, and inodes. The slab memory used by kernel drivers is also part of this, but some slab objects may not be reclaimable. Looking at your slabinfo, nothing stands out as being gigabytes in size, so this is less likely. > Let's Do a Definitive Test > To confirm that this memory is just cache and is fully available, please try this sequence of commands. This will show the "before and after" state clearly. > > Check memory before: > > Bash > > free -h > (Note the numbers in the used, free, and buff/cache columns). > > Run the sync and drop caches command again: > > Bash > > sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3 > (Using <<< is just a slightly different way of piping the input, functionally identical for this purpose). > > Immediately check memory after: > > Bash > > sleep 1; free -h > (Note the new numbers. You should see buff/cache drop to near zero, free increase by ~12Gi, and used decrease by ~12Gi). > > What to expect: You will see that the memory moves from the buff/cache column to the free column. This proves the memory was never "unavailable" to begin with; it was just being used intelligently by the kernel. > > The ultimate test: Try to launch a memory-intensive application (you could even start ollama again with a model). You will see the kernel surrender the cache memory to the new application on demand. The buff/cache value will decrease as your application's memory (RES in htop) increases. > > > > ``` ~ ❯❯❯ free -h (base) total used free shared buff/cache available Mem: 31Gi 19Gi 652Mi 103Mi 11Gi 11Gi Swap: 4,0Gi 2,7Gi 1,3Gi ~ ❯❯❯ bash - "sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3" bash: sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3: No such file or directory ~ ❯❯❯ bash -c "sudo sync && sudo tee /proc/sys/vm/drop_caches <<< 3" 3 ``` ``` ~ ❯❯❯ sleep 1; free -h (base) total used free shared buff/cache available Mem: 31Gi 19Gi 11Gi 113Mi 1,1Gi 11Gi Swap: 4,0Gi 2,7Gi 1,3Gi ``` **_Starting ollama again and try running another model indeed fails cause OOM. It can't use the used RAM._** > This is an excellent and crucial result. Thank you for running that test. You have 100% confirmed that this is not a simple caching issue and there is a genuine problem. > > Let's break down the "after" free -h output, because it's incredibly revealing: > > total used free shared buff/cache available > Mem: 31Gi 19Gi 11Gi 113Mi 1,1Gi 11Gi > buff/cache Dropped: You can see that buff/cache went from 11Gi down to 1.1Gi. This means the drop_caches command did work. It successfully freed ~10Gi of disk cache. > free Increased: The free memory went from 652Mi up to 11Gi. This is where the cleared cache went. > used Did Not Change: This is the most important clue. The used memory stayed at a whopping 19Gi. > Your conclusion is exactly right: "It can't use the used RAM." > > New Diagnosis: Hidden Memory Allocation (Kernel or Driver Leak) > We have now proven that of your 19Gi of used RAM: > > Only ~1.1Gi is disk cache. > The remaining ~18Gi is actively held by something that is not a normal user-space process visible in htop. > This points strongly to memory being allocated and held at the kernel level. When ollama was killed, it seems its corresponding kernel-side allocations, likely within a driver, were not released. > > The number one suspect is the GPU driver. > > Modern AI frameworks like ollama don't just use the GPU's own VRAM; they ask the GPU driver to "pin" or "map" large amounts of system RAM for high-speed transfers to the GPU. When the user-space process dies unexpectedly, this pinned memory can get "leaked" or orphaned within the driver. It won't be released until the driver itself is reset. > > How to Troubleshoot Further (The Real Culprit) > Let's find exactly where that memory is. > > Step 1: Check GPU Driver Memory (Highest Priority) > This is the most likely place to find the missing gigabytes. Please run the command corresponding to your GPU vendor. > > For NVIDIA GPUs: > > Bash > > nvidia-smi > For AMD GPUs: > > Bash > > rocm-smi > Look at the memory usage table in the output. I suspect you will see a massive amount of memory reported as used, even though there are no applications listed in the "Processes" section. This would be definitive proof of a driver-level memory leak. > > Step 2: Check for Kernel Slab Allocations with slabtop > If nvidia-smi or rocm-smi don't show anything obvious, the next step is to check the kernel's own memory pools (the "slab"). > > Run the following command for about 10-15 seconds, then press q to quit. It sorts by the most memory-intensive objects. > > Bash > > sudo slabtop -s c > Look for any entry at the top of the list that is consuming an unusual amount of memory (multiple gigabytes). The name of the object (e.g., something with dma, iommu, nvidia, amdgpu in the name) can tell us which driver is responsible. > > Step 3: Hunt for Zombie Processes > This is less likely but possible. A zombie process (Z state) can sometimes hold onto resources. > > Bash > > ps aux | grep ' Z' > If you see any processes listed, it could be a contributing factor. > > The Solution > Regardless of what the diagnostics above show, the underlying problem is a resource leak caused by the improper termination of ollama. > > Immediate Fix (Recommended): A reboot is the cleanest, safest, and most guaranteed way to fix this. It will force the kernel and all its drivers to re-initialize from a clean state, releasing all orphaned memory. > > Advanced Fix (Use with Caution): If you can confirm it's a GPU driver leak (e.g., via nvidia-smi), you could try to unload and reload the kernel module for your GPU. This is complex and can destabilize your system, requiring a reboot anyway. A reboot is far more reliable. > > Please post the output of nvidia-smi (or rocm-smi) and slabtop. This will give us the final piece of the puzzle. ``` ~ ❯❯❯ nvidia-smi (base) Fri Jun 20 19:48:20 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 On | N/A | | 0% 54C P5 48W / 380W | 397MiB / 24576MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1594 G /usr/lib/xorg/Xorg 153MiB | | 0 N/A N/A 2496 G cinnamon 59MiB | | 0 N/A N/A 7223 G ...1873736b243b2831076d864fd0435 34MiB | | 0 N/A N/A 17766 G /usr/share/codium/codium 44MiB | | 0 N/A N/A 24901 G ...led --variations-seed-version 50MiB | ``` **_--> nope_** ``` ~ ❯❯❯ sudo slabtop -s c Active / Total Objects (% used) : 1656679 / 1889888 (87,7%) Active / Total Slabs (% used) : 34548 / 34548 (100,0%) Active / Total Caches (% used) : 359 / 460 (78,0%) Active / Total Size (% used) : 307307,50K / 390599,47K (78,7%) Minimum / Average / Maximum Object : 0,01K / 0,21K / 30,52K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 73108 20213 27% 0,57K 2611 28 41776K radix_tree_node 2505 2493 99% 10,25K 835 3 26720K task_struct 115676 114072 98% 0,18K 2629 44 21032K vm_area_struct 16821 4672 27% 1,15K 623 27 19936K ext4_inode_cache 38240 10742 28% 0,50K 1195 32 19120K kmalloc-512 70686 20943 29% 0,19K 1683 42 13464K dentry 75990 75990 100% 0,13K 2533 30 10132K kernfs_node_cache 9856 9688 98% 1,00K 308 32 9856K iommu_iova_magazine 13107 9411 71% 0,62K 257 51 8224K inode_cache 12480 12322 98% 0,50K 390 32 6240K kmalloc-rnd-05-512 58800 21770 37% 0,09K 1400 42 5600K trace_event_file 21632 6532 30% 0,25K 676 32 5408K kmalloc-256 6854 4307 62% 0,69K 149 46 4768K proc_inode_cache 74560 72560 97% 0,06K 1165 64 4660K anon_vma_chain 117198 114623 97% 0,04K 1149 102 4596K vma_lock 4576 4165 91% 1,00K 143 32 4576K kmalloc-rnd-04-1k 5658 5483 96% 0,77K 138 41 4416K shmem_inode_cache 40716 40090 98% 0,10K 1044 39 4176K anon_vma 15968 15521 97% 0,25K 499 32 3992K filp 54400 49227 90% 0,06K 850 64 3400K dmaengine-unmap-2 12864 11684 90% 0,25K 402 32 3216K maple_node 1584 1535 96% 2,00K 99 16 3168K kmalloc-cg-2k 728 704 96% 4,00K 91 8 2912K biovec-max 344 321 93% 8,00K 86 4 2752K kmalloc-rnd-04-8k 672 650 96% 4,00K 84 8 2688K kmalloc-cg-4k 664 640 96% 4,00K 83 8 2656K kmalloc-rnd-06-4k 1264 1214 96% 2,00K 79 16 2528K kmalloc-rnd-04-2k 155 143 92% 14,17K 78 2 2496K uvm_va_range_external_t 4960 4631 93% 0,50K 155 32 2480K kmalloc-rnd-04-512 4736 4608 97% 0,50K 148 32 2368K pool_workqueue 2304 2304 100% 1,00K 72 32 2304K kmalloc-cg-1k 32032 28484 88% 0,07K 572 56 2288K vmap_area 1035 1035 100% 2,06K 69 15 2208K sighand_cache 3082 2944 95% 0,69K 67 46 2144K skbuff_small_head 268 268 100% 8,00K 67 4 2144K kmalloc-rnd-10-8k 2574 2574 100% 0,81K 66 39 2112K sock_inode_cache 1024 1024 100% 2,00K 64 16 2048K kmalloc-rnd-12-2k 2048 2042 99% 1,00K 64 32 2048K kmalloc-1k 1624 1624 100% 1,12K 58 28 1856K signal_cache 1824 1824 100% 1,00K 57 32 1824K PING 1792 1728 96% 1,00K 56 32 1792K kmalloc-rnd-08-1k 440 440 100% 4,00K 55 8 1760K kmalloc-rnd-12-4k 440 399 90% 4,00K 55 8 1760K kmalloc-rnd-04-4k 9504 9504 100% 0,18K 216 44 1728K ext4_groupinfo_4k ``` **_--> nope_** ``` ~ ❯❯❯ ps aux | grep ' Z' (base) user 26721 0.0 0.0 9284 2064 pts/4 S+ 19:50 0:00 grep --color=auto Z ``` **_--> nope_** > This is absolutely fascinating. Your diagnostic work has been perfect, and the results have allowed us to eliminate every single one of the usual suspects. > > Let's recap what we now know for certain: > > It's NOT Disk Cache: Your drop_caches test proved this. You cleared ~10GB of cache, but the used memory stayed stubbornly high at 19GB, and the system remained OOM. > It's NOT GPU VRAM: Your nvidia-smi output is crystal clear. The GPU's own 24GB of memory is almost completely free (397MiB used). > It's NOT a Kernel Slab Leak: Your slabtop output shows total slab usage is only a few hundred megabytes, orders of magnitude less than the ~18GB of memory we are hunting for. > It's NOT Zombie Processes: Your ps command showed no zombies. > You have methodically and successfully cornered a very rare and difficult-to-diagnose type of memory leak. > > Final Diagnosis: Orphaned Pinned System Memory > With all other possibilities exhausted, we are left with the most likely culprit: A leak of "pinned" system RAM within the NVIDIA driver itself. > > Here's what that means: > > To achieve the highest performance, ollama asks the NVIDIA driver to take a large chunk of your regular system RAM (your 32GB) and "pin" it. > Pinning memory locks it in place so the GPU can access it directly for high-speed data transfers (DMA). This is a core feature of CUDA and is essential for performance. > This pinned RAM is not part of the GPU's VRAM, so it doesn't show up in nvidia-smi. It is not a standard kernel cache, so slabtop doesn't see it and drop_caches can't touch it. > When you killed ollama, the user-space process vanished, but a bug or an unhandled state within the NVIDIA kernel driver caused it to fail to un-pin those gigabytes of system RAM. > The memory is now in a black hole: the kernel knows it's allocated and won't give it to other processes, but it doesn't belong to any running process, so no tool can account for it. It is, for all practical purposes, leaked by the driver until that driver is reset. > > The Solution: Reboot > You have done all the troubleshooting that can be reasonably done on a live system. You have proven where the problem isn't, and in doing so, have isolated it to a driver-state issue. > > The necessary and correct solution now is to reboot the system. > > A reboot is the only way to guarantee that the kernel completely discards the faulty memory state and forces the NVIDIA driver to be reloaded from a clean slate. This will release the orphaned pinned memory and return your system to a healthy state. > > This was an excellent deep-dive into Linux memory management. The key takeaway for the future is that forcefully killing GPU-intensive applications with kill -9 carries a risk of leaving the driver in an inconsistent state, and a graceful shutdown should always be preferred if possible. @rick-github In all case, thanks for all your suggestions up to now Maybe before trying to dig even deeper, even if I don't know what to try next, maybe the best trial now would be to at least know if it's replicable by anyone. The model I end up trying was [unsloth/Kimi-Dev-72B-GGUF:Q3_K_M](https://huggingface.co/unsloth/Kimi-Dev-72B-GGUF/blob/main/Kimi-Dev-72B-Q3_K_M.gguf)

GiteaMirror commented

2026-05-04 17:57:44 -05:00

@blakkd commented on GitHub (Jun 30, 2025):

I just noticed it's actually not specific to Kimi-Dev, but is happening with any model loaded in RAM instead of VRAM (here I tried with https://huggingface.co/WisdomShell/RewardAnything-8B-v1).
I'll open a new issue to clean this up, instead of just modifying the title because I've been really confusing all along.

@blakkd commented on GitHub (Jun 30, 2025): I just noticed it's actually not specific to Kimi-Dev, but is happening with any model loaded in RAM instead of VRAM (here I tried with https://huggingface.co/WisdomShell/RewardAnything-8B-v1). I'll open a new issue to clean this up, instead of just modifying the title because I've been really confusing all along. ![Image](https://github.com/user-attachments/assets/4ede38c0-fe87-4edf-9542-e75ed08742c6)

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#69384