[GH-ISSUE #7130] GPU VRAM Usage Timeout Warnings on Embeddings Model Load #30289

New Issue

GiteaMirror · 2026-04-22T09:49:48-05:00

GiteaMirror commented

2026-04-22 09:49:48 -05:00

Originally created by @maxruby on GitHub (Oct 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7130

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Description:
We are experiencing repeated GPU VRAM recovery timeouts while running multiple models on the ollama platform. The GPU in use is 2x NVIDIA RTX A5000. The system logs show that the VRAM usage does not recover within the expected timeout (5+ seconds), which affects performance and stability.

The issue occurs when loading and running embedding models, particularly when switching between different models. Below is an excerpt of the log showing the repeated warnings and the affected models:

Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  243 tensors
Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type  f16:  146 tensors
Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.710+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167422277 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.959+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417004589 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 12:31:46 Aerion3 ollama[104243]: time=2024-10-08T12:31:46.768+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057837537 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

Possible Causes under consideration:

Insufficient VRAM: The GPU may not have enough VRAM to efficiently load and unload multiple models, leading to delays in VRAM recovery. This seems unlikely because the nvtop never shows GPU consumption above 4% when the warning appears
Memory Fragmentation: Fragmented memory in the VRAM might be causing issues when trying to allocate new contiguous memory.
GPU Overload: The workload may be too heavy for the GPU, especially if multiple models are loaded simultaneously.
CUDA Memory Management: Inefficient management of CUDA memory offloading may be causing this issue.

System Information:

GPU: 2x NVIDIA RTX A5000
ollama Version: 0.3.12
Model in Use: jina-embeddings-v2-base-en:latest, mxbai-embed-large-v1 and other models
VRAM Available: ~24 GiB x2

Steps to Reproduce:

Load and run multiple models in parallel or sequentially.
Monitor system logs for VRAM recovery warnings as models are switched or loaded.

Expected Behavior:
The system should manage VRAM more efficiently, releasing it within the timeout to avoid warnings and improve overall performance.

Request:
Please investigate possible improvements to VRAM memory management or provide guidance on how to better configure the system to avoid these timeouts.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.12

Originally created by @maxruby on GitHub (Oct 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7130 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Description: We are experiencing repeated GPU VRAM recovery timeouts while running multiple models on the ollama platform. The GPU in use is 2x NVIDIA RTX A5000. The system logs show that the VRAM usage does not recover within the expected timeout (5+ seconds), which affects performance and stability. The issue occurs when loading and running embedding models, particularly when switching between different models. Below is an excerpt of the log showing the repeated warnings and the affected models: ``` Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type f32: 243 tensors Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type f16: 146 tensors Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.710+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167422277 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.959+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417004589 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 12:31:46 Aerion3 ollama[104243]: time=2024-10-08T12:31:46.768+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057837537 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ``` Possible Causes under consideration: - Insufficient VRAM: The GPU may not have enough VRAM to efficiently load and unload multiple models, leading to delays in VRAM recovery. **This seems unlikely because the `nvtop` never shows GPU consumption above 4% when the warning appears** - Memory Fragmentation: Fragmented memory in the VRAM might be causing issues when trying to allocate new contiguous memory. - GPU Overload: The workload may be too heavy for the GPU, especially if multiple models are loaded simultaneously. - CUDA Memory Management: Inefficient management of CUDA memory offloading may be causing this issue. System Information: - GPU: 2x NVIDIA RTX A5000 - ollama Version: 0.3.12 - Model in Use: `jina-embeddings-v2-base-en:latest`, `mxbai-embed-large-v1` and other models - VRAM Available: ~24 GiB x2 Steps to Reproduce: - Load and run multiple models in parallel or sequentially. - Monitor system logs for VRAM recovery warnings as models are switched or loaded. Expected Behavior: The system should manage VRAM more efficiently, releasing it within the timeout to avoid warnings and improve overall performance. Request: Please investigate possible improvements to VRAM memory management or provide guidance on how to better configure the system to avoid these timeouts. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.12

GiteaMirror added the memory bug labels 2026-04-22 09:49:48 -05:00

GiteaMirror commented

2026-04-22 09:49:50 -05:00

@rick-github commented on GitHub (Oct 8, 2024):

Please post complete logs, it makes debugging much easier.

@rick-github commented on GitHub (Oct 8, 2024): Please post complete logs, it makes debugging much easier.

GiteaMirror commented

2026-04-22 09:49:51 -05:00

@maxruby commented on GitHub (Oct 8, 2024):

@rick-github
Thanks for the quick reply. Of course, here is a log for one session which shows the warning:

Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest))
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                   jina-bert-v2.block_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                jina-bert-v2.context_length u32              = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   4:              jina-bert-v2.embedding_length u32              = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   5:           jina-bert-v2.feed_forward_length u32              = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   6:          jina-bert-v2.attention.head_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   7:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   9:              jina-bert-v2.attention.causal bool             = false
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                  jina-bert-v2.pooling_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = jina-v2-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,30528]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,30528]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  17:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  19:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  20:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  111 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f16:   85 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch             = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30528
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = -1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 137M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 136.78 M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 261.05 MiB (16.01 BPW)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:02 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.16 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    44.72 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   216.32 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 2048
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    20.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     3.50 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 441
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds"
Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 |  654.483323ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                           bert.block_count u32              = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                        bert.context_length u32              = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  243 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f16:  146 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch             = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30522
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 335M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 334.09 M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 637.85 MiB (16.02 BPW)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:03 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.32 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    60.62 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   577.23 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    48.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    25.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     5.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 849
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743
Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds"
Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 |  779.844982ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  147.365562ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  163.893891ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |  159.224843ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |   163.23205ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 |  153.931289ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 |  165.401049ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

@maxruby commented on GitHub (Oct 8, 2024): @rick-github Thanks for the quick reply. Of course, here is a log for one session which shows the warning: ``` Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest)) Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 2: jina-bert-v2.block_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 3: jina-bert-v2.context_length u32 = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 4: jina-bert-v2.embedding_length u32 = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 5: jina-bert-v2.feed_forward_length u32 = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 6: jina-bert-v2.attention.head_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 7: jina-bert-v2.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 9: jina-bert-v2.attention.causal bool = false Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 10: jina-bert-v2.pooling_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.model str = bert Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.pre str = jina-v2-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,30528] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,30528] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f32: 111 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f16: 85 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30528 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type = -1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type = 137M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params = 136.78 M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size = 261.05 MiB (16.01 BPW) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22 Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:02 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.16 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 44.72 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 216.32 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 2048 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 20.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 3.50 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 441 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds" Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 | 654.483323ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 2: bert.block_count u32 = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 3: bert.context_length u32 = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 9: bert.attention.causal bool = false Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 10: bert.pooling_type u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.model str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f32: 243 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f16: 146 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30522 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type = 335M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params = 334.09 M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size = 637.85 MiB (16.02 BPW) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21 Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:03 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.32 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 60.62 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 577.23 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 48.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 25.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 5.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 849 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743 Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds" Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 | 779.844982ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 147.365562ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 163.893891ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 159.224843ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 163.23205ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 | 153.931289ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 | 165.401049ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ```

GiteaMirror commented

2026-04-22 09:49:52 -05:00

@rick-github commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

@rick-github commented on GitHub (Oct 8, 2024): I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message.

GiteaMirror commented

2026-04-22 09:49:52 -05:00

@maxruby commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix.

I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay?

Recently, there appears to be this change, not sure if this is directly related to the issue here: 90ca84172c (diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197)

@maxruby commented on GitHub (Oct 8, 2024): > I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message. Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix. I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay? Recently, there appears to be this change, not sure if this is directly related to the issue here: https://github.com/ollama/ollama/commit/90ca84172c2a98ecfd76eb7e05cd3e33e1dde507#diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197

GiteaMirror commented

2026-04-22 09:49:53 -05:00

@dhiltgen commented on GitHub (Oct 8, 2024):

Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused.

The fact that 2 or more logs are reporting gpu VRAM usage didn't recover... for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.

@dhiltgen commented on GitHub (Oct 8, 2024): Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused. The fact that 2 or more logs are reporting `gpu VRAM usage didn't recover...` for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.

GiteaMirror commented

2026-04-22 09:49:54 -05:00

@rick-github commented on GitHub (Oct 8, 2024):

I was able to duplicate this with one model.

ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished"
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama-1  | [GIN] 2024/10/08 - 20:16:00 | 200 |  704.668383ms |      172.21.0.1 | POST     "/api/embed"
ollama-1  | [GIN] 2024/10/08 - 20:16:09 | 200 |      16.724µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:19 | 200 |      29.061µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:29 | 200 |      17.811µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.526µs |      172.21.0.3 | HEAD     "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      88.317µs |      172.21.0.3 | GET      "/api/ps"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.191µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:49 | 200 |      21.965µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:59 | 200 |      17.751µs |             ::1 | GET      "/"
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama-1  | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"

At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.

@rick-github commented on GitHub (Oct 8, 2024): I was able to duplicate this with one model. ``` ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished" ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama-1 | [GIN] 2024/10/08 - 20:16:00 | 200 | 704.668383ms | 172.21.0.1 | POST "/api/embed" ollama-1 | [GIN] 2024/10/08 - 20:16:09 | 200 | 16.724µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:19 | 200 | 29.061µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:29 | 200 | 17.811µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.526µs | 172.21.0.3 | HEAD "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 88.317µs | 172.21.0.3 | GET "/api/ps" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.191µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:49 | 200 | 21.965µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:59 | 200 | 17.751µs | ::1 | GET "/" ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama-1 | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ``` At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.

GiteaMirror commented

2026-04-22 09:49:55 -05:00

@dhiltgen commented on GitHub (Oct 8, 2024):

Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was.

At 2024-10-08T20:17:00.547Z the GPU has now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
Then the subprocess is terminated
At 2024-10-08T20:17:01.100Z the GPU has now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB

@dhiltgen commented on GitHub (Oct 8, 2024): Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was. - At `2024-10-08T20:17:00.547Z` the GPU has `now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"` - Then the subprocess is terminated - At `2024-10-08T20:17:01.100Z` the GPU has `now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB`

GiteaMirror commented

2026-04-22 09:49:56 -05:00

@rick-github commented on GitHub (Oct 8, 2024):

ollama-1  | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

The model was mxbai-embed-large:latest.

@rick-github commented on GitHub (Oct 8, 2024): ``` ollama-1 | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ``` The model was mxbai-embed-large:latest.

GiteaMirror commented

2026-04-22 09:49:57 -05:00

@rick-github commented on GitHub (Oct 8, 2024):

FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070:

ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s
ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama  | [GIN] 2024/10/08 - 21:18:43 | 200 |       24.74µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:18:53 | 200 |      16.985µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:19:03 | 200 |       20.75µs |       127.0.0.1 | GET      "/"
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

@rick-github commented on GitHub (Oct 8, 2024): FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070: ``` ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama | [GIN] 2024/10/08 - 21:18:43 | 200 | 24.74µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:18:53 | 200 | 16.985µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:19:03 | 200 | 20.75µs | 127.0.0.1 | GET "/" ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" ```

GiteaMirror commented

2026-04-22 09:49:58 -05:00

@dhiltgen commented on GitHub (Oct 8, 2024):

I think this is specific to this model and our prediction being too high...

% ollama ps                           
NAME                        ID              SIZE      PROCESSOR    UNTIL              
mxbai-embed-large:latest    468836162de7    1.2 GB    100% GPU     4 minutes from now

However:

% nvidia-smi
Tue Oct  8 22:32:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         Off |   00000000:01:00.0 Off |                  N/A |
| 35%   36C    P0             N/A /   19W |     716MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     89203      C   ...unners/cuda_v12/ollama_llama_server        712MiB |
+-----------------------------------------------------------------------------------------+

We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.

@dhiltgen commented on GitHub (Oct 8, 2024): I think this is specific to this model and our prediction being too high... ``` % ollama ps NAME ID SIZE PROCESSOR UNTIL mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 4 minutes from now ``` However: ``` % nvidia-smi Tue Oct 8 22:32:17 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GT 1030 Off | 00000000:01:00.0 Off | N/A | | 35% 36C P0 N/A / 19W | 716MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 89203 C ...unners/cuda_v12/ollama_llama_server 712MiB | +-----------------------------------------------------------------------------------------+ ``` We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.

GiteaMirror commented

2026-04-22 09:50:00 -05:00

@maxruby commented on GitHub (Oct 27, 2024):

Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?

@maxruby commented on GitHub (Oct 27, 2024): Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?

GiteaMirror commented

2026-04-22 09:50:01 -05:00

@vtien commented on GitHub (Jan 16, 2025):

Also facing this issue - any updates?

@vtien commented on GitHub (Jan 16, 2025): Also facing this issue - any updates?

GiteaMirror commented

2026-04-22 09:50:04 -05:00

@claymore666 commented on GitHub (Mar 6, 2025):

same here. ollama 0.5.13 on windows // RTX 2070 Super. Happens especially when using larger embedding models (snowflake-arctic-embed2:[137m|latest]) Does not happen with nomic-embed-text. Also seems more likely to happen when processing larger files.

@claymore666 commented on GitHub (Mar 6, 2025): same here. ollama 0.5.13 on windows // RTX 2070 Super. Happens especially when using larger embedding models (snowflake-arctic-embed2:[137m|latest]) Does not happen with nomic-embed-text. Also seems more likely to happen when processing larger files.

GiteaMirror commented

2026-04-22 09:50:05 -05:00

@rick-github commented on GitHub (Mar 6, 2025):

It's just a warning, you can ignore it. If you are seeing issues with snowflake with 0.5.13, it's probably https://github.com/ollama/ollama/issues/9511.

@rick-github commented on GitHub (Mar 6, 2025): It's just a warning, you can ignore it. If you are seeing issues with snowflake with 0.5.13, it's probably https://github.com/ollama/ollama/issues/9511.

GiteaMirror commented

2026-04-22 09:50:06 -05:00

@tjwebb commented on GitHub (Mar 21, 2025):

I am also seeing this issue with gemma3 on an Nvidia L4.

It's not "just a warning" because something is happening to cause the model to be unloaded and all subsequent requests to hang until I reboot ollama.

log output is here: https://gist.github.com/tjwebb/77657b55f62d55c487124bdd887b72e6

@tjwebb commented on GitHub (Mar 21, 2025): I am also seeing this issue with gemma3 on an Nvidia L4. It's not "just a warning" because something is happening to cause the model to be unloaded and all subsequent requests to hang until I reboot ollama. log output is here: https://gist.github.com/tjwebb/77657b55f62d55c487124bdd887b72e6

GiteaMirror commented

2026-04-22 09:50:08 -05:00

@rick-github commented on GitHub (Mar 21, 2025):

[GIN] 2025/03/21 - 20:58:28 | 200 |  2.425022178s |      172.18.0.2 | POST     "/api/generate"
time=2025-03-21T21:03:34.123Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.147554459 model=/root/.ollama/models/blobs/sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c

The last request was at 20:58:28, and your keep_alive is 5 minutes. So at around 21:03:28 the model is unloaded, which starts the VRAM usage monitoring. The monitor expects VRAM usage to recover within 5 seconds, so at around 21:03:33 (or 21:03:34.123 from the timestamp, some drift due to overhead and imprecise timestamps) the monitor alerts that GPU VRAM usage hasn't recovered.

If you can set OLLAMA_DEBUG=1 in the environment and post a log, there might be details that indicate what's going on.

@rick-github commented on GitHub (Mar 21, 2025): ``` [GIN] 2025/03/21 - 20:58:28 | 200 | 2.425022178s | 172.18.0.2 | POST "/api/generate" time=2025-03-21T21:03:34.123Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.147554459 model=/root/.ollama/models/blobs/sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c ``` The last request was at 20:58:28, and your `keep_alive` is 5 minutes. So at around 21:03:28 the model is unloaded, which starts the VRAM usage monitoring. The monitor expects VRAM usage to recover within 5 seconds, so at around 21:03:33 (or 21:03:34.123 from the timestamp, some drift due to overhead and imprecise timestamps) the monitor alerts that GPU VRAM usage hasn't recovered. If you can set `OLLAMA_DEBUG=1` in the environment and post a log, there might be details that indicate what's going on.

GiteaMirror commented

2026-04-22 09:50:08 -05:00

@tjwebb commented on GitHub (Mar 21, 2025):

I set OLLAMA_LOAD_TIMEOUT="60m" and that seems to have no effect.

I will reproduce shortly and post the full log with dfebug on

@tjwebb commented on GitHub (Mar 21, 2025): I set `OLLAMA_LOAD_TIMEOUT="60m"` and that seems to have no effect. I will reproduce shortly and post the full log with dfebug on

GiteaMirror commented

2026-04-22 09:50:09 -05:00

@rick-github commented on GitHub (Mar 21, 2025):

OLLAMA_KEEP_ALIVE

@rick-github commented on GitHub (Mar 21, 2025): `OLLAMA_KEEP_ALIVE`

GiteaMirror commented

2026-04-22 09:50:10 -05:00

@tjwebb commented on GitHub (Mar 21, 2025):

So at around 21:03:28 the model is unloaded,

Problem is that I am still making requests, the model should not get unloaded at all. I'm making tons of requests. All my requests begin hanging right at this moment when it spits out the unload warning and it never recovers.

@tjwebb commented on GitHub (Mar 21, 2025): > So at around 21:03:28 the model is unloaded, Problem is that I am still making requests, the model should not get unloaded at all. I'm making tons of requests. All my requests begin hanging right at this moment when it spits out the unload warning and it never recovers.

GiteaMirror commented

2026-04-22 09:50:13 -05:00

@rick-github commented on GitHub (Mar 21, 2025):

It may be a queueing problem in the server, not a problem with the runner. I see that you have OLLAMA_NUM_PARALLEL=1, OLLAMA_MAX_QUEUE=1000 and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.

@rick-github commented on GitHub (Mar 21, 2025): It may be a queueing problem in the server, not a problem with the runner. I see that you have `OLLAMA_NUM_PARALLEL=1`, `OLLAMA_MAX_QUEUE=1000` and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.

GiteaMirror commented

2026-04-22 09:50:14 -05:00

@tjwebb commented on GitHub (Mar 21, 2025):

Yea I set those other flags while trying to solve this problem. It happens without them also.

I'm sending about 150 queries over the course of an hour. Is that a lot?

@tjwebb commented on GitHub (Mar 21, 2025): Yea I set those other flags while trying to solve this problem. It happens without them also. I'm sending about 150 queries over the course of an hour. Is that a lot?

GiteaMirror commented

2026-04-22 09:50:15 -05:00

@rick-github commented on GitHub (Mar 21, 2025):

I'm sending about 150 queries over the course of an hour. Is that a lot?

It shouldn't be, I do much more than that for weeks at a time without issues.

@rick-github commented on GitHub (Mar 21, 2025): > I'm sending about 150 queries over the course of an hour. Is that a lot? It shouldn't be, I do much more than that for weeks at a time without issues.

GiteaMirror commented

2026-04-22 09:50:17 -05:00

@tjwebb commented on GitHub (Mar 21, 2025):

It failed again, the debug logs are HUGE. I'm trying to find a way to upload a 500MB text file

@tjwebb commented on GitHub (Mar 21, 2025): It failed again, the debug logs are HUGE. I'm trying to find a way to upload a 500MB text file

GiteaMirror commented

2026-04-22 09:50:18 -05:00

@maxruby commented on GitHub (Mar 22, 2025):

It may be a queueing problem in the server, not a problem with the runner. I see that you have OLLAMA_NUM_PARALLEL=1, OLLAMA_MAX_QUEUE=1000 and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.

Since I am the one who first reported this issue in October 2024 and has been watching the progress since then, I must say I am not really understanding why the insistence on this being "just a warning" by @rick-github. Even if its "just a warning", its clear also that this is causing confusion and in some cases evidently clogging API responses using some Embedding models on ollama servers. In my experience so far, I have encountered the problem reported here on both my own private ollama server (M1 OSX ) and at work on an Ubuntu server with 2x NVIDIA RTX A5000s. The question is still why we don't empower others to contribute to offer potential solutions. I offered to help in the past (see https://github.com/ollama/ollama/issues/7130#issuecomment-2440082258) but there was no positive response to this.

@maxruby commented on GitHub (Mar 22, 2025): > It may be a queueing problem in the server, not a problem with the runner. I see that you have `OLLAMA_NUM_PARALLEL=1`, `OLLAMA_MAX_QUEUE=1000` and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server. Since I am the one who first reported this issue in October 2024 and has been watching the progress since then, I must say I am not really understanding why the insistence on this being "just a warning" by @rick-github. Even if its "just a warning", its clear also that this is causing confusion and in some cases evidently clogging API responses using some Embedding models on ollama servers. In my experience so far, I have encountered the problem reported here on both my own private ollama server (M1 OSX ) and at work on an Ubuntu server with 2x NVIDIA RTX A5000s. The question is still why we don't empower others to contribute to offer potential solutions. I offered to help in the past (see https://github.com/ollama/ollama/issues/7130#issuecomment-2440082258) but there was no positive response to this.

GiteaMirror commented

2026-04-22 09:50:19 -05:00

@rick-github commented on GitHub (Mar 22, 2025):

I agree that it's confusing, so let me try to clarify.

First, let's identify why this warning is emitted in the first place. Freeing GPU memory is asynchronous. That is, after ollama has told the GPU to release the memory, there's a delay before that memory shows up in the free list. So when unloading a model, ollama monitors the amount of free memory to see when the newly freed memory becomes available. The monitor is started just before the unload, and runs until either the freed memory (at least 80% of it) shows up in the free list, or the monitor timeout (5s) expires. Note, "monitor" - this has no effect on the actual unloading/loading of models.

Now to the specific examples of this warning in this issue. In your original post, the warning was emitted while unloading the jina-embeddings-v2-base-en and mxbai-embed-large models. Daniel investigated and found that the problem is a disconnect between what ollama estimates the memory requirements are going to be, and the amount of memory actually consumed by the GPU backend (llama.cpp) and the driver. Notably, nvidia-smi reports that the actual memory used is 712MB, while ollama estimated 1200MB. (The disconnect between estimation and actual is a whole 'nuther matter, see eg #6160). So the VRAM monitor is expecting at least (1200 * 80%) = 960MB, but the actual amount returned was 712MB. So the test condition in the VRAM monitor is not satisfied and it runs until the timeout.

The PR that Daniel proposed only changed the test condition - there was no change to how the models are loaded/unloaded. The warning in this case is bogus - it does not represent an actual problem (other than the confusion it generates for people who see it in their logs).

The second and third reports I admittedly gave short shrift, based on the results of the first investigation, the implication of small models, and the fact that there were real problems with snowflake at the time. If @vtien and @claymore666 are still having problems, I invite them to add logs and the problem can be diagnosed.

The fourth report from Travis was accompanied by logs which indicated the model involved was not a small one (ie, not a result of the actual/estimated logic failure), so I was able to put together a timeline, a possible cause, and a request for more detailed information. The investigation is on-going but so far it looks like the problem has nothing to do with VRAM usage.

Nobody is dis-empowering others from contributing potential solutions. Everybody is free to clone the repo, do their own investigation, and submit a PR. If that's not in their skill set, then a comprehensive bug report, preferably with logs and a nice to have code sample that demonstrates the problem will be gratefully received and given attention. It should be noted that there only so many man-hours available so some issues are prioritized over others, but I assure you that all issues and solutions are considered.

@rick-github commented on GitHub (Mar 22, 2025): I agree that it's confusing, so let me try to clarify. First, let's identify why this warning is emitted in the first place. Freeing GPU memory is asynchronous. That is, after ollama has told the GPU to release the memory, there's a delay before that memory shows up in the free list. So when unloading a model, ollama monitors the amount of free memory to see when the newly freed memory becomes available. The monitor is started just before the unload, and runs until either the freed memory (at least 80% of it) shows up in the free list, or the monitor timeout (5s) expires. Note, "monitor" - this has no effect on the actual unloading/loading of models. Now to the specific examples of this warning in this issue. In your original post, the warning was emitted while unloading the jina-embeddings-v2-base-en and mxbai-embed-large models. Daniel investigated and found that the problem is a disconnect between what ollama estimates the memory requirements are going to be, and the amount of memory actually consumed by the GPU backend (llama.cpp) and the driver. Notably, nvidia-smi reports that the actual memory used is 712MB, while ollama estimated 1200MB. (The disconnect between estimation and actual is a whole 'nuther matter, see eg #6160). So the VRAM monitor is expecting at least (1200 * 80%) = 960MB, but the actual amount returned was 712MB. So the test condition in the VRAM monitor is not satisfied and it runs until the timeout. The PR that Daniel proposed only changed the test condition - there was no change to how the models are loaded/unloaded. The warning in this case is bogus - it does not represent an actual problem (other than the confusion it generates for people who see it in their logs). The second and third reports I admittedly gave short shrift, based on the results of the first investigation, the implication of small models, and the fact that there were real problems with snowflake at the time. If @vtien and @claymore666 are still having problems, I invite them to add logs and the problem can be diagnosed. The fourth report from Travis was accompanied by logs which indicated the model involved was not a small one (ie, not a result of the actual/estimated logic failure), so I was able to put together a timeline, a possible cause, and a request for more detailed information. The investigation is on-going but so far it looks like the problem has nothing to do with VRAM usage. Nobody is dis-empowering others from contributing potential solutions. Everybody is free to clone the repo, do their own investigation, and submit a PR. If that's not in their skill set, then a comprehensive bug report, preferably with logs and a nice to have code sample that demonstrates the problem will be gratefully received and given attention. It should be noted that there only so many man-hours available so some issues are prioritized over others, but I assure you that all issues and solutions are considered.

GiteaMirror commented

2026-04-22 09:50:20 -05:00

@rick-github commented on GitHub (Mar 22, 2025):

Returning to the specific issue of the warning in Travis' report, we see it is in fact another estimate/actual mismatch:

$ ollama run gemma3:12b-it-q8_0 ''
$ curl -s localhost:11434/api/ps | jq '.models[].size_vram / 1024 / 1024 | floor'
11617
$ nvidia-smi -x -q | yq -p=xml -o=json  | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory'
7448 MiB

ollama has estimated 11G, actual usage is 7G. (11617 * 0.8) = 9293.6 which is more than the 7448 that is released when the model is unloaded.

gemma3 estimates are wildly inaccurate, which is only going to get worse for large windows with #9892. The ollama team are aware of this, so I'm hoping that the estimation logic will receive some attention.

@rick-github commented on GitHub (Mar 22, 2025): Returning to the specific issue of the warning in Travis' report, we see it is in fact another estimate/actual mismatch: ```console $ ollama run gemma3:12b-it-q8_0 '' $ curl -s localhost:11434/api/ps | jq '.models[].size_vram / 1024 / 1024 | floor' 11617 $ nvidia-smi -x -q | yq -p=xml -o=json | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory' 7448 MiB ``` ollama has estimated 11G, actual usage is 7G. (11617 * 0.8) = 9293.6 which is more than the 7448 that is released when the model is unloaded. gemma3 estimates are wildly inaccurate, which is only going to get worse for large windows with #9892. The ollama team are aware of this, so I'm hoping that the estimation logic will receive some attention.

Sign in to join this conversation.

Branches Tags

main

parth-update-hermes-launch

parth-agent-system-prompt-cwd

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-fix-claude-model-picker

parth-api-status-context-length

docs/vscode-extension-setup

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#30289