[GH-ISSUE #7130] GPU VRAM Usage Timeout Warnings on Embeddings Model Load #4529

Open
opened 2026-04-12 15:28:00 -05:00 by GiteaMirror · 26 comments
Owner

Originally created by @maxruby on GitHub (Oct 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7130

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Description:
We are experiencing repeated GPU VRAM recovery timeouts while running multiple models on the ollama platform. The GPU in use is 2x NVIDIA RTX A5000. The system logs show that the VRAM usage does not recover within the expected timeout (5+ seconds), which affects performance and stability.

The issue occurs when loading and running embedding models, particularly when switching between different models. Below is an excerpt of the log showing the repeated warnings and the affected models:

Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  243 tensors
Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type  f16:  146 tensors
Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.710+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167422277 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.959+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417004589 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 12:31:46 Aerion3 ollama[104243]: time=2024-10-08T12:31:46.768+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057837537 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

Possible Causes under consideration:

  • Insufficient VRAM: The GPU may not have enough VRAM to efficiently load and unload multiple models, leading to delays in VRAM recovery. This seems unlikely because the nvtop never shows GPU consumption above 4% when the warning appears
  • Memory Fragmentation: Fragmented memory in the VRAM might be causing issues when trying to allocate new contiguous memory.
  • GPU Overload: The workload may be too heavy for the GPU, especially if multiple models are loaded simultaneously.
  • CUDA Memory Management: Inefficient management of CUDA memory offloading may be causing this issue.

System Information:

  • GPU: 2x NVIDIA RTX A5000
  • ollama Version: 0.3.12
  • Model in Use: jina-embeddings-v2-base-en:latest, mxbai-embed-large-v1 and other models
  • VRAM Available: ~24 GiB x2

Steps to Reproduce:

  • Load and run multiple models in parallel or sequentially.
  • Monitor system logs for VRAM recovery warnings as models are switched or loaded.

Expected Behavior:
The system should manage VRAM more efficiently, releasing it within the timeout to avoid warnings and improve overall performance.

Request:
Please investigate possible improvements to VRAM memory management or provide guidance on how to better configure the system to avoid these timeouts.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.12

Originally created by @maxruby on GitHub (Oct 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7130 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Description: We are experiencing repeated GPU VRAM recovery timeouts while running multiple models on the ollama platform. The GPU in use is 2x NVIDIA RTX A5000. The system logs show that the VRAM usage does not recover within the expected timeout (5+ seconds), which affects performance and stability. The issue occurs when loading and running embedding models, particularly when switching between different models. Below is an excerpt of the log showing the repeated warnings and the affected models: ``` Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type f32: 243 tensors Okt 08 12:26:37 Aerion3 ollama[104243]: llama_model_loader: - type f16: 146 tensors Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.710+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167422277 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 12:31:41 Aerion3 ollama[104243]: time=2024-10-08T12:31:41.959+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417004589 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 12:31:46 Aerion3 ollama[104243]: time=2024-10-08T12:31:46.768+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057837537 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ``` Possible Causes under consideration: - Insufficient VRAM: The GPU may not have enough VRAM to efficiently load and unload multiple models, leading to delays in VRAM recovery. **This seems unlikely because the `nvtop` never shows GPU consumption above 4% when the warning appears** - Memory Fragmentation: Fragmented memory in the VRAM might be causing issues when trying to allocate new contiguous memory. - GPU Overload: The workload may be too heavy for the GPU, especially if multiple models are loaded simultaneously. - CUDA Memory Management: Inefficient management of CUDA memory offloading may be causing this issue. System Information: - GPU: 2x NVIDIA RTX A5000 - ollama Version: 0.3.12 - Model in Use: `jina-embeddings-v2-base-en:latest`, `mxbai-embed-large-v1` and other models - VRAM Available: ~24 GiB x2 Steps to Reproduce: - Load and run multiple models in parallel or sequentially. - Monitor system logs for VRAM recovery warnings as models are switched or loaded. Expected Behavior: The system should manage VRAM more efficiently, releasing it within the timeout to avoid warnings and improve overall performance. Request: Please investigate possible improvements to VRAM memory management or provide guidance on how to better configure the system to avoid these timeouts. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.12
GiteaMirror added the memorybug labels 2026-04-12 15:28:00 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

Please post complete logs, it makes debugging much easier.

<!-- gh-comment-id:2399844275 --> @rick-github commented on GitHub (Oct 8, 2024): Please post complete logs, it makes debugging much easier.
Author
Owner

@maxruby commented on GitHub (Oct 8, 2024):

@rick-github
Thanks for the quick reply. Of course, here is a log for one session which shows the warning:

Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest))
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                   jina-bert-v2.block_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                jina-bert-v2.context_length u32              = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   4:              jina-bert-v2.embedding_length u32              = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   5:           jina-bert-v2.feed_forward_length u32              = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   6:          jina-bert-v2.attention.head_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   7:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   9:              jina-bert-v2.attention.causal bool             = false
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                  jina-bert-v2.pooling_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = jina-v2-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,30528]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,30528]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  17:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  19:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  20:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  111 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f16:   85 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch             = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30528
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = -1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 137M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 136.78 M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 261.05 MiB (16.01 BPW)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:02 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.16 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    44.72 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   216.32 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 2048
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    20.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     3.50 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 441
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds"
Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 |  654.483323ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                           bert.block_count u32              = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                        bert.context_length u32              = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  243 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f16:  146 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch             = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30522
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 335M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 334.09 M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 637.85 MiB (16.02 BPW)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:03 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.32 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    60.62 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   577.23 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    48.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    25.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     5.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 849
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743
Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds"
Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 |  779.844982ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  147.365562ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  163.893891ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |  159.224843ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |   163.23205ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 |  153.931289ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 |  165.401049ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d 
<!-- gh-comment-id:2400074583 --> @maxruby commented on GitHub (Oct 8, 2024): @rick-github Thanks for the quick reply. Of course, here is a log for one session which shows the warning: ``` Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest)) Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 2: jina-bert-v2.block_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 3: jina-bert-v2.context_length u32 = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 4: jina-bert-v2.embedding_length u32 = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 5: jina-bert-v2.feed_forward_length u32 = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 6: jina-bert-v2.attention.head_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 7: jina-bert-v2.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 9: jina-bert-v2.attention.causal bool = false Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 10: jina-bert-v2.pooling_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.model str = bert Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.pre str = jina-v2-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,30528] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,30528] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f32: 111 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f16: 85 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30528 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type = -1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type = 137M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params = 136.78 M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size = 261.05 MiB (16.01 BPW) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22 Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:02 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.16 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 44.72 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 216.32 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 2048 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 20.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 3.50 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 441 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds" Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 | 654.483323ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 2: bert.block_count u32 = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 3: bert.context_length u32 = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 9: bert.attention.causal bool = false Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 10: bert.pooling_type u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.model str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f32: 243 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f16: 146 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30522 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type = 335M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params = 334.09 M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size = 637.85 MiB (16.02 BPW) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21 Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:03 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.32 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 60.62 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 577.23 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 48.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 25.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 5.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 849 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743 Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds" Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 | 779.844982ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 147.365562ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 163.893891ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 159.224843ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 163.23205ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 | 153.931289ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 | 165.401049ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ```
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

<!-- gh-comment-id:2400783003 --> @rick-github commented on GitHub (Oct 8, 2024): I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message.
Author
Owner

@maxruby commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix.

I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay?

Recently, there appears to be this change, not sure if this is directly related to the issue here: 90ca84172c (diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197)

<!-- gh-comment-id:2400818628 --> @maxruby commented on GitHub (Oct 8, 2024): > I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message. Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix. I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay? Recently, there appears to be this change, not sure if this is directly related to the issue here: https://github.com/ollama/ollama/commit/90ca84172c2a98ecfd76eb7e05cd3e33e1dde507#diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197
Author
Owner

@dhiltgen commented on GitHub (Oct 8, 2024):

Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused.

The fact that 2 or more logs are reporting gpu VRAM usage didn't recover... for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.

<!-- gh-comment-id:2400897382 --> @dhiltgen commented on GitHub (Oct 8, 2024): Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused. The fact that 2 or more logs are reporting `gpu VRAM usage didn't recover...` for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

I was able to duplicate this with one model.

ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished"
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama-1  | [GIN] 2024/10/08 - 20:16:00 | 200 |  704.668383ms |      172.21.0.1 | POST     "/api/embed"
ollama-1  | [GIN] 2024/10/08 - 20:16:09 | 200 |      16.724µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:19 | 200 |      29.061µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:29 | 200 |      17.811µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.526µs |      172.21.0.3 | HEAD     "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      88.317µs |      172.21.0.3 | GET      "/api/ps"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.191µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:49 | 200 |      21.965µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:59 | 200 |      17.751µs |             ::1 | GET      "/"
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama-1  | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"

At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.

<!-- gh-comment-id:2400908077 --> @rick-github commented on GitHub (Oct 8, 2024): I was able to duplicate this with one model. ``` ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished" ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama-1 | [GIN] 2024/10/08 - 20:16:00 | 200 | 704.668383ms | 172.21.0.1 | POST "/api/embed" ollama-1 | [GIN] 2024/10/08 - 20:16:09 | 200 | 16.724µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:19 | 200 | 29.061µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:29 | 200 | 17.811µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.526µs | 172.21.0.3 | HEAD "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 88.317µs | 172.21.0.3 | GET "/api/ps" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.191µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:49 | 200 | 21.965µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:59 | 200 | 17.751µs | ::1 | GET "/" ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama-1 | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ``` At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.
Author
Owner

@dhiltgen commented on GitHub (Oct 8, 2024):

Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was.

  • At 2024-10-08T20:17:00.547Z the GPU has now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
  • Then the subprocess is terminated
  • At 2024-10-08T20:17:01.100Z the GPU has now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB
<!-- gh-comment-id:2400922395 --> @dhiltgen commented on GitHub (Oct 8, 2024): Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was. - At `2024-10-08T20:17:00.547Z` the GPU has `now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"` - Then the subprocess is terminated - At `2024-10-08T20:17:01.100Z` the GPU has `now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB`
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

ollama-1  | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

The model was mxbai-embed-large:latest.

<!-- gh-comment-id:2400926124 --> @rick-github commented on GitHub (Oct 8, 2024): ``` ollama-1 | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ``` The model was mxbai-embed-large:latest.
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070:

ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s
ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama  | [GIN] 2024/10/08 - 21:18:43 | 200 |       24.74µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:18:53 | 200 |      16.985µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:19:03 | 200 |       20.75µs |       127.0.0.1 | GET      "/"
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"
<!-- gh-comment-id:2400931306 --> @rick-github commented on GitHub (Oct 8, 2024): FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070: ``` ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama | [GIN] 2024/10/08 - 21:18:43 | 200 | 24.74µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:18:53 | 200 | 16.985µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:19:03 | 200 | 20.75µs | 127.0.0.1 | GET "/" ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" ```
Author
Owner

@dhiltgen commented on GitHub (Oct 8, 2024):

I think this is specific to this model and our prediction being too high...

% ollama ps                           
NAME                        ID              SIZE      PROCESSOR    UNTIL              
mxbai-embed-large:latest    468836162de7    1.2 GB    100% GPU     4 minutes from now

However:

% nvidia-smi
Tue Oct  8 22:32:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         Off |   00000000:01:00.0 Off |                  N/A |
| 35%   36C    P0             N/A /   19W |     716MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     89203      C   ...unners/cuda_v12/ollama_llama_server        712MiB |
+-----------------------------------------------------------------------------------------+

We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.

<!-- gh-comment-id:2400935506 --> @dhiltgen commented on GitHub (Oct 8, 2024): I think this is specific to this model and our prediction being too high... ``` % ollama ps NAME ID SIZE PROCESSOR UNTIL mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 4 minutes from now ``` However: ``` % nvidia-smi Tue Oct 8 22:32:17 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GT 1030 Off | 00000000:01:00.0 Off | N/A | | 35% 36C P0 N/A / 19W | 716MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 89203 C ...unners/cuda_v12/ollama_llama_server 712MiB | +-----------------------------------------------------------------------------------------+ ``` We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.
Author
Owner

@maxruby commented on GitHub (Oct 27, 2024):

Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?

<!-- gh-comment-id:2440082258 --> @maxruby commented on GitHub (Oct 27, 2024): Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?
Author
Owner

@vtien commented on GitHub (Jan 16, 2025):

Also facing this issue - any updates?

<!-- gh-comment-id:2594425658 --> @vtien commented on GitHub (Jan 16, 2025): Also facing this issue - any updates?
Author
Owner

@claymore666 commented on GitHub (Mar 6, 2025):

same here. ollama 0.5.13 on windows // RTX 2070 Super. Happens especially when using larger embedding models (snowflake-arctic-embed2:[137m|latest]) Does not happen with nomic-embed-text. Also seems more likely to happen when processing larger files.

<!-- gh-comment-id:2704812361 --> @claymore666 commented on GitHub (Mar 6, 2025): same here. ollama 0.5.13 on windows // RTX 2070 Super. Happens especially when using larger embedding models (snowflake-arctic-embed2:[137m|latest]) Does not happen with nomic-embed-text. Also seems more likely to happen when processing larger files.
Author
Owner

@rick-github commented on GitHub (Mar 6, 2025):

It's just a warning, you can ignore it. If you are seeing issues with snowflake with 0.5.13, it's probably https://github.com/ollama/ollama/issues/9511.

<!-- gh-comment-id:2705077849 --> @rick-github commented on GitHub (Mar 6, 2025): It's just a warning, you can ignore it. If you are seeing issues with snowflake with 0.5.13, it's probably https://github.com/ollama/ollama/issues/9511.
Author
Owner

@tjwebb commented on GitHub (Mar 21, 2025):

I am also seeing this issue with gemma3 on an Nvidia L4.

It's not "just a warning" because something is happening to cause the model to be unloaded and all subsequent requests to hang until I reboot ollama.

log output is here: https://gist.github.com/tjwebb/77657b55f62d55c487124bdd887b72e6

<!-- gh-comment-id:2744446589 --> @tjwebb commented on GitHub (Mar 21, 2025): I am also seeing this issue with gemma3 on an Nvidia L4. It's not "just a warning" because something is happening to cause the model to be unloaded and all subsequent requests to hang until I reboot ollama. log output is here: https://gist.github.com/tjwebb/77657b55f62d55c487124bdd887b72e6
Author
Owner

@rick-github commented on GitHub (Mar 21, 2025):

[GIN] 2025/03/21 - 20:58:28 | 200 |  2.425022178s |      172.18.0.2 | POST     "/api/generate"
time=2025-03-21T21:03:34.123Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.147554459 model=/root/.ollama/models/blobs/sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c

The last request was at 20:58:28, and your keep_alive is 5 minutes. So at around 21:03:28 the model is unloaded, which starts the VRAM usage monitoring. The monitor expects VRAM usage to recover within 5 seconds, so at around 21:03:33 (or 21:03:34.123 from the timestamp, some drift due to overhead and imprecise timestamps) the monitor alerts that GPU VRAM usage hasn't recovered.

If you can set OLLAMA_DEBUG=1 in the environment and post a log, there might be details that indicate what's going on.

<!-- gh-comment-id:2744492610 --> @rick-github commented on GitHub (Mar 21, 2025): ``` [GIN] 2025/03/21 - 20:58:28 | 200 | 2.425022178s | 172.18.0.2 | POST "/api/generate" time=2025-03-21T21:03:34.123Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.147554459 model=/root/.ollama/models/blobs/sha256-728e7e4ac6e65cd68bf0d6c3ebf2e9944b19d3ad2da49ab53265457f6de1f02c ``` The last request was at 20:58:28, and your `keep_alive` is 5 minutes. So at around 21:03:28 the model is unloaded, which starts the VRAM usage monitoring. The monitor expects VRAM usage to recover within 5 seconds, so at around 21:03:33 (or 21:03:34.123 from the timestamp, some drift due to overhead and imprecise timestamps) the monitor alerts that GPU VRAM usage hasn't recovered. If you can set `OLLAMA_DEBUG=1` in the environment and post a log, there might be details that indicate what's going on.
Author
Owner

@tjwebb commented on GitHub (Mar 21, 2025):

I set OLLAMA_LOAD_TIMEOUT="60m" and that seems to have no effect.

I will reproduce shortly and post the full log with dfebug on

<!-- gh-comment-id:2744498200 --> @tjwebb commented on GitHub (Mar 21, 2025): I set `OLLAMA_LOAD_TIMEOUT="60m"` and that seems to have no effect. I will reproduce shortly and post the full log with dfebug on
Author
Owner

@rick-github commented on GitHub (Mar 21, 2025):

OLLAMA_KEEP_ALIVE

<!-- gh-comment-id:2744499623 --> @rick-github commented on GitHub (Mar 21, 2025): `OLLAMA_KEEP_ALIVE`
Author
Owner

@tjwebb commented on GitHub (Mar 21, 2025):

So at around 21:03:28 the model is unloaded,

Problem is that I am still making requests, the model should not get unloaded at all. I'm making tons of requests. All my requests begin hanging right at this moment when it spits out the unload warning and it never recovers.

<!-- gh-comment-id:2744500611 --> @tjwebb commented on GitHub (Mar 21, 2025): > So at around 21:03:28 the model is unloaded, Problem is that I am still making requests, the model should not get unloaded at all. I'm making tons of requests. All my requests begin hanging right at this moment when it spits out the unload warning and it never recovers.
Author
Owner

@rick-github commented on GitHub (Mar 21, 2025):

It may be a queueing problem in the server, not a problem with the runner. I see that you have OLLAMA_NUM_PARALLEL=1, OLLAMA_MAX_QUEUE=1000 and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.

<!-- gh-comment-id:2744532417 --> @rick-github commented on GitHub (Mar 21, 2025): It may be a queueing problem in the server, not a problem with the runner. I see that you have `OLLAMA_NUM_PARALLEL=1`, `OLLAMA_MAX_QUEUE=1000` and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.
Author
Owner

@tjwebb commented on GitHub (Mar 21, 2025):

Yea I set those other flags while trying to solve this problem. It happens without them also.

I'm sending about 150 queries over the course of an hour. Is that a lot?

<!-- gh-comment-id:2744536851 --> @tjwebb commented on GitHub (Mar 21, 2025): Yea I set those other flags while trying to solve this problem. It happens without them also. I'm sending about 150 queries over the course of an hour. Is that a lot?
Author
Owner

@rick-github commented on GitHub (Mar 21, 2025):

I'm sending about 150 queries over the course of an hour. Is that a lot?

It shouldn't be, I do much more than that for weeks at a time without issues.

<!-- gh-comment-id:2744542229 --> @rick-github commented on GitHub (Mar 21, 2025): > I'm sending about 150 queries over the course of an hour. Is that a lot? It shouldn't be, I do much more than that for weeks at a time without issues.
Author
Owner

@tjwebb commented on GitHub (Mar 21, 2025):

It failed again, the debug logs are HUGE. I'm trying to find a way to upload a 500MB text file

<!-- gh-comment-id:2744542901 --> @tjwebb commented on GitHub (Mar 21, 2025): It failed again, the debug logs are HUGE. I'm trying to find a way to upload a 500MB text file
Author
Owner

@maxruby commented on GitHub (Mar 22, 2025):

It may be a queueing problem in the server, not a problem with the runner. I see that you have OLLAMA_NUM_PARALLEL=1, OLLAMA_MAX_QUEUE=1000 and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server.

Since I am the one who first reported this issue in October 2024 and has been watching the progress since then, I must say I am not really understanding why the insistence on this being "just a warning" by @rick-github. Even if its "just a warning", its clear also that this is causing confusion and in some cases evidently clogging API responses using some Embedding models on ollama servers. In my experience so far, I have encountered the problem reported here on both my own private ollama server (M1 OSX ) and at work on an Ubuntu server with 2x NVIDIA RTX A5000s. The question is still why we don't empower others to contribute to offer potential solutions. I offered to help in the past (see https://github.com/ollama/ollama/issues/7130#issuecomment-2440082258) but there was no positive response to this.

<!-- gh-comment-id:2745264816 --> @maxruby commented on GitHub (Mar 22, 2025): > It may be a queueing problem in the server, not a problem with the runner. I see that you have `OLLAMA_NUM_PARALLEL=1`, `OLLAMA_MAX_QUEUE=1000` and the queries are chronologically close, so I assume your client is dumping a lot of queries into the ollama API and processing the results as they are returned. If the runner is going idle for 5 minutes and the server is unresponsive to the point where you have to restart it, that would indicate a problem in the server. Since I am the one who first reported this issue in October 2024 and has been watching the progress since then, I must say I am not really understanding why the insistence on this being "just a warning" by @rick-github. Even if its "just a warning", its clear also that this is causing confusion and in some cases evidently clogging API responses using some Embedding models on ollama servers. In my experience so far, I have encountered the problem reported here on both my own private ollama server (M1 OSX ) and at work on an Ubuntu server with 2x NVIDIA RTX A5000s. The question is still why we don't empower others to contribute to offer potential solutions. I offered to help in the past (see https://github.com/ollama/ollama/issues/7130#issuecomment-2440082258) but there was no positive response to this.
Author
Owner

@rick-github commented on GitHub (Mar 22, 2025):

I agree that it's confusing, so let me try to clarify.

First, let's identify why this warning is emitted in the first place. Freeing GPU memory is asynchronous. That is, after ollama has told the GPU to release the memory, there's a delay before that memory shows up in the free list. So when unloading a model, ollama monitors the amount of free memory to see when the newly freed memory becomes available. The monitor is started just before the unload, and runs until either the freed memory (at least 80% of it) shows up in the free list, or the monitor timeout (5s) expires. Note, "monitor" - this has no effect on the actual unloading/loading of models.

Now to the specific examples of this warning in this issue. In your original post, the warning was emitted while unloading the jina-embeddings-v2-base-en and mxbai-embed-large models. Daniel investigated and found that the problem is a disconnect between what ollama estimates the memory requirements are going to be, and the amount of memory actually consumed by the GPU backend (llama.cpp) and the driver. Notably, nvidia-smi reports that the actual memory used is 712MB, while ollama estimated 1200MB. (The disconnect between estimation and actual is a whole 'nuther matter, see eg #6160). So the VRAM monitor is expecting at least (1200 * 80%) = 960MB, but the actual amount returned was 712MB. So the test condition in the VRAM monitor is not satisfied and it runs until the timeout.

The PR that Daniel proposed only changed the test condition - there was no change to how the models are loaded/unloaded. The warning in this case is bogus - it does not represent an actual problem (other than the confusion it generates for people who see it in their logs).

The second and third reports I admittedly gave short shrift, based on the results of the first investigation, the implication of small models, and the fact that there were real problems with snowflake at the time. If @vtien and @claymore666 are still having problems, I invite them to add logs and the problem can be diagnosed.

The fourth report from Travis was accompanied by logs which indicated the model involved was not a small one (ie, not a result of the actual/estimated logic failure), so I was able to put together a timeline, a possible cause, and a request for more detailed information. The investigation is on-going but so far it looks like the problem has nothing to do with VRAM usage.

Nobody is dis-empowering others from contributing potential solutions. Everybody is free to clone the repo, do their own investigation, and submit a PR. If that's not in their skill set, then a comprehensive bug report, preferably with logs and a nice to have code sample that demonstrates the problem will be gratefully received and given attention. It should be noted that there only so many man-hours available so some issues are prioritized over others, but I assure you that all issues and solutions are considered.

<!-- gh-comment-id:2745343690 --> @rick-github commented on GitHub (Mar 22, 2025): I agree that it's confusing, so let me try to clarify. First, let's identify why this warning is emitted in the first place. Freeing GPU memory is asynchronous. That is, after ollama has told the GPU to release the memory, there's a delay before that memory shows up in the free list. So when unloading a model, ollama monitors the amount of free memory to see when the newly freed memory becomes available. The monitor is started just before the unload, and runs until either the freed memory (at least 80% of it) shows up in the free list, or the monitor timeout (5s) expires. Note, "monitor" - this has no effect on the actual unloading/loading of models. Now to the specific examples of this warning in this issue. In your original post, the warning was emitted while unloading the jina-embeddings-v2-base-en and mxbai-embed-large models. Daniel investigated and found that the problem is a disconnect between what ollama estimates the memory requirements are going to be, and the amount of memory actually consumed by the GPU backend (llama.cpp) and the driver. Notably, nvidia-smi reports that the actual memory used is 712MB, while ollama estimated 1200MB. (The disconnect between estimation and actual is a whole 'nuther matter, see eg #6160). So the VRAM monitor is expecting at least (1200 * 80%) = 960MB, but the actual amount returned was 712MB. So the test condition in the VRAM monitor is not satisfied and it runs until the timeout. The PR that Daniel proposed only changed the test condition - there was no change to how the models are loaded/unloaded. The warning in this case is bogus - it does not represent an actual problem (other than the confusion it generates for people who see it in their logs). The second and third reports I admittedly gave short shrift, based on the results of the first investigation, the implication of small models, and the fact that there were real problems with snowflake at the time. If @vtien and @claymore666 are still having problems, I invite them to add logs and the problem can be diagnosed. The fourth report from Travis was accompanied by logs which indicated the model involved was not a small one (ie, not a result of the actual/estimated logic failure), so I was able to put together a timeline, a possible cause, and a request for more detailed information. The investigation is on-going but so far it looks like the problem has nothing to do with VRAM usage. Nobody is dis-empowering others from contributing potential solutions. Everybody is free to clone the repo, do their own investigation, and submit a PR. If that's not in their skill set, then a comprehensive bug report, preferably with logs and a nice to have code sample that demonstrates the problem will be gratefully received and given attention. It should be noted that there only so many man-hours available so some issues are prioritized over others, but I assure you that all issues and solutions are considered.
Author
Owner

@rick-github commented on GitHub (Mar 22, 2025):

Returning to the specific issue of the warning in Travis' report, we see it is in fact another estimate/actual mismatch:

$ ollama run gemma3:12b-it-q8_0 ''
$ curl -s localhost:11434/api/ps | jq '.models[].size_vram / 1024 / 1024 | floor'
11617
$ nvidia-smi -x -q | yq -p=xml -o=json  | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory'
7448 MiB

ollama has estimated 11G, actual usage is 7G. (11617 * 0.8) = 9293.6 which is more than the 7448 that is released when the model is unloaded.

gemma3 estimates are wildly inaccurate, which is only going to get worse for large windows with #9892. The ollama team are aware of this, so I'm hoping that the estimation logic will receive some attention.

<!-- gh-comment-id:2745402504 --> @rick-github commented on GitHub (Mar 22, 2025): Returning to the specific issue of the warning in Travis' report, we see it is in fact another estimate/actual mismatch: ```console $ ollama run gemma3:12b-it-q8_0 '' $ curl -s localhost:11434/api/ps | jq '.models[].size_vram / 1024 / 1024 | floor' 11617 $ nvidia-smi -x -q | yq -p=xml -o=json | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory' 7448 MiB ``` ollama has estimated 11G, actual usage is 7G. (11617 * 0.8) = 9293.6 which is more than the 7448 that is released when the model is unloaded. gemma3 estimates are wildly inaccurate, which is only going to get worse for large windows with #9892. The ollama team are aware of this, so I'm hoping that the estimation logic will receive some attention.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4529