[GH-ISSUE #7130] GPU VRAM Usage Timeout Warnings on Embeddings Model Load #4529

New Issue

@rick-github commented on GitHub (Oct 8, 2024):

Please post complete logs, it makes debugging much easier.

@rick-github commented on GitHub (Oct 8, 2024): Please post complete logs, it makes debugging much easier.

GiteaMirror commented

@maxruby commented on GitHub (Oct 8, 2024):

@rick-github
Thanks for the quick reply. Of course, here is a log for one session which shows the warning:

Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest))
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                   jina-bert-v2.block_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                jina-bert-v2.context_length u32              = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   4:              jina-bert-v2.embedding_length u32              = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   5:           jina-bert-v2.feed_forward_length u32              = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   6:          jina-bert-v2.attention.head_count u32              = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   7:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv   9:              jina-bert-v2.attention.causal bool             = false
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                  jina-bert-v2.pooling_type u32              = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = jina-v2-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,30528]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,30528]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  16:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  17:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  19:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  20:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = true
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  111 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type  f16:   85 tensors
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch             = jina-bert-v2
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30528
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 768
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 3072
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = -1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 137M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 136.78 M
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 261.05 MiB (16.01 BPW)
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = jina-embeddings-v2-base-en
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:02 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.16 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    44.72 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   216.32 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 2048
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    72.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    20.00 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     3.50 MiB
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 441
Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742
Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds"
Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 |  654.483323ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   0:                       general.architecture str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   1:                               general.name str              = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   2:                           bert.block_count u32              = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   3:                        bert.context_length u32              = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   8:                          general.file_type u32              = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f32:  243 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type  f16:  146 tensors
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format           = GGUF V3 (latest)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch             = bert
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type       = WPM
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab          = 30522
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd           = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer          = 24
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head           = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv        = 16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot            = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa            = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v    = 64
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa            = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa     = 1024
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps       = 1.0e-12
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff             = 4096
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert         = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used    = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type     = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type        = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling     = linear
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn  = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned   = unknown
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv       = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank      = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type       = 335M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype      = F16
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params     = 334.09 M
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size       = 637.85 MiB (16.02 BPW)
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name     = mxbai-embed-large-v1
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token        = 100 '[UNK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token        = 102 '[SEP]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token        = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token        = 101 '[CLS]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token       = 103 '[MASK]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token         = 0 '[PAD]'
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices:
Okt 08 15:39:03 Aerion3 ollama[104243]:   Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size =    0.32 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:        CPU buffer size =    60.62 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors:      CUDA0 buffer size =   577.23 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx      = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch    = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch   = 512
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base  = 10000.0
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init:      CUDA0 KV buffer size =    48.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:      CUDA0 compute buffer size =    25.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model:  CUDA_Host compute buffer size =     5.01 MiB
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes  = 849
Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2
Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743
Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds"
Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 |  779.844982ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  147.365562ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 |  163.893891ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |  159.224843ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 |   163.23205ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 |  153.931289ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 |  165.401049ms |      172.18.0.7 | POST     "/api/embed"
Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3
Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

@maxruby commented on GitHub (Oct 8, 2024): @rick-github Thanks for the quick reply. Of course, here is a log for one session which shows the warning: ``` Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c parallel=1 available=25198395392 required="754.6 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="220.0 GiB" free_swap="2.0 GiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.465+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[23.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="754.6 MiB" memory.required.partial="754.6 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[754.6 MiB]" memory.weights.total="222.3 MiB" memory.weights.repeating="177.6 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.468+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --parallel 1 --port 36121" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.469+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] build info | build=10 commit="188ed47" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="136898111410176" timestamp=1728394742 total_threads=48 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="36121" tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 24 key-value pairs and 196 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 (version GGUF V3 (latest)) Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 2: jina-bert-v2.block_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 3: jina-bert-v2.context_length u32 = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 4: jina-bert-v2.embedding_length u32 = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 5: jina-bert-v2.feed_forward_length u32 = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 6: jina-bert-v2.attention.head_count u32 = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 7: jina-bert-v2.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 9: jina-bert-v2.attention.causal bool = false Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 10: jina-bert-v2.pooling_type u32 = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.model str = bert Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.pre str = jina-v2-en Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,30528] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,30528] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = true Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f32: 111 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llama_model_loader: - type f16: 85 tensors Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: arch = jina-bert-v2 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30528 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 768 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 8.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 3072 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope type = -1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model type = 137M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model params = 136.78 M Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: model size = 261.05 MiB (16.01 BPW) Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: general.name = jina-embeddings-v2-base-en Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 22 Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:02 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:02 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.16 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading 12 repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: offloaded 13/13 layers to GPU Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 44.72 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 216.32 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 2048 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 72.00 MiB, K (f16): 36.00 MiB, V (f16): 36.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 20.00 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 3.50 MiB Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 441 Okt 08 15:39:02 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:02 Aerion3 ollama[699549]: INFO [main] model loaded | tid="136898111410176" timestamp=1728394742 Okt 08 15:39:02 Aerion3 ollama[104243]: time=2024-10-08T15:39:02.721+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.25 seconds" Okt 08 15:39:02 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:02 | 200 | 654.483323ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-17536158-810f-656a-7572-2efa837e5d0c library=cuda total="23.7 GiB" available="22.9 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.498+02:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee library=cuda total="23.7 GiB" available="23.4 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-870f99fd-7d9e-432f-685d-4f02a7144fee parallel=1 available=25119686656 required="1.1 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=server.go:103 msg="system memory" total="251.5 GiB" free="219.6 GiB" free_swap="2.0 GiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.499+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[23.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="580.2 MiB" memory.weights.repeating="520.6 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1067081739/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 48315" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.502+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.503+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] build info | build=10 commit="188ed47" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] system info | n_threads=24 n_threads_batch=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127626911346688" timestamp=1728394743 total_threads=48 Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="48315" tid="127626911346688" timestamp=1728394743 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 0: general.architecture str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 1: general.name str = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 2: bert.block_count u32 = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 3: bert.context_length u32 = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 8: general.file_type u32 = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 9: bert.attention.causal bool = false Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 10: bert.pooling_type u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 14: tokenizer.ggml.model str = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f32: 243 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llama_model_loader: - type f16: 146 tensors Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: special tokens cache size = 5 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_vocab: token to piece cache size = 0.2032 MB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: format = GGUF V3 (latest) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: arch = bert Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab type = WPM Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_vocab = 30522 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_merges = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: vocab_only = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_train = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_layer = 24 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_head_kv = 16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_rot = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_swa = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_k = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_head_v = 64 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_gqa = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_k_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_embd_v_gqa = 1024 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_eps = 1.0e-12 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: f_logit_scale = 0.0e+00 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ff = 4096 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_expert_used = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: causal attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: pooling type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope type = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope scaling = linear Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_base_train = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: freq_scale_train = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: n_ctx_orig_yarn = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: rope_finetuned = unknown Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_conv = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_inner = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_d_state = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_rank = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model type = 335M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model ftype = F16 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model params = 334.09 M Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: model size = 637.85 MiB (16.02 BPW) Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: general.name = mxbai-embed-large-v1 Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: BOS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: EOS token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: UNK token = 100 '[UNK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: SEP token = 102 '[SEP]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: PAD token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: CLS token = 101 '[CLS]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: MASK token = 103 '[MASK]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: LF token = 0 '[PAD]' Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_print_meta: max token length = 21 Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Okt 08 15:39:03 Aerion3 ollama[104243]: ggml_cuda_init: found 1 CUDA devices: Okt 08 15:39:03 Aerion3 ollama[104243]: Device 0: NVIDIA RTX A5000, compute capability 8.6, VMM: yes Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: ggml ctx size = 0.32 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading 24 repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloading non-repeating layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: offloaded 25/25 layers to GPU Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CPU buffer size = 60.62 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llm_load_tensors: CUDA0 buffer size = 577.23 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ctx = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_batch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: n_ubatch = 512 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: flash_attn = 0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_base = 10000.0 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: freq_scale = 1 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_kv_cache_init: CUDA0 KV buffer size = 48.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA0 compute buffer size = 25.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: CUDA_Host compute buffer size = 5.01 MiB Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph nodes = 849 Okt 08 15:39:03 Aerion3 ollama[104243]: llama_new_context_with_model: graph splits = 2 Okt 08 15:39:03 Aerion3 ollama[104243]: time=2024-10-08T15:39:03.754+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Okt 08 15:39:03 Aerion3 ollama[699621]: INFO [main] model loaded | tid="127626911346688" timestamp=1728394743 Okt 08 15:39:04 Aerion3 ollama[104243]: time=2024-10-08T15:39:04.005+02:00 level=INFO source=server.go:626 msg="llama runner started in 0.50 seconds" Okt 08 15:39:04 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:04 | 200 | 779.844982ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 147.365562ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:39:57 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:39:57 | 200 | 163.893891ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 159.224843ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:41:31 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:41:31 | 200 | 163.23205ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:22 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:22 | 200 | 153.931289ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:42:23 Aerion3 ollama[104243]: [GIN] 2024/10/08 - 15:42:23 | 200 | 165.401049ms | 172.18.0.7 | POST "/api/embed" Okt 08 15:47:27 Aerion3 ollama[104243]: time=2024-10-08T15:47:27.875+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.167041341 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:28 Aerion3 ollama[104243]: time=2024-10-08T15:47:28.125+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.417041382 model=/usr/share/ollama/.ollama/models/blobs/sha256-03aeef8493ea9a2b8da023e8d21ce77a97e83de66a692417579aa27b717cdaf3 Okt 08 15:47:32 Aerion3 ollama[104243]: time=2024-10-08T15:47:32.960+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.08496612 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.209+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.334297991 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d Okt 08 15:47:33 Aerion3 ollama[104243]: time=2024-10-08T15:47:33.459+02:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.584203757 model=/usr/share/ollama/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ```

GiteaMirror commented

@rick-github commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

@rick-github commented on GitHub (Oct 8, 2024): I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message.

GiteaMirror commented

@maxruby commented on GitHub (Oct 8, 2024):

I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the didn't recover message.

Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix.

I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay?

Recently, there appears to be this change, not sure if this is directly related to the issue here: 90ca84172c (diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197)

@maxruby commented on GitHub (Oct 8, 2024): > I haven't completely figured this out yet, but it looks like a harmless race condition. ollama creates a go routine to monitor the recovery of VRAM usage, but by the time the routine runs, the model is already unloaded because it's so small. The routine misses this and spends another 5 seconds waiting for the VRAM usage to decrease before giving up with the `didn't recover` message. Thanks for the work so far. This would seem to be consistent with the logs. I guess this would mean perhaps https://github.com/ollama/ollama/blob/main/server/sched.go needs a fix. I suppose a potential race condition involving VRAM monitoring could be associated with sched.go since it happens during model unloading, where the goroutine monitoring VRAM waits for memory recovery, potentially missing the model unload event. Might this cause an unnecessary delay? Recently, there appears to be this change, not sure if this is directly related to the issue here: https://github.com/ollama/ollama/commit/90ca84172c2a98ecfd76eb7e05cd3e33e1dde507#diff-aa405274e6abb3e80a9097185ed34bbf8c98cf3410448ab10cac320f4a7aca4aR197

GiteaMirror commented

@dhiltgen commented on GitHub (Oct 8, 2024):

Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused.

The fact that 2 or more logs are reporting gpu VRAM usage didn't recover... for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.

@dhiltgen commented on GitHub (Oct 8, 2024): Running with OLLAMA_DEBUG=1 might yield a bit more details on where the algorithm is getting confused. The fact that 2 or more logs are reporting `gpu VRAM usage didn't recover...` for each model shouldn't happen. Somehow there appear to be multiple unloads in flight for a single model and that shouldn't happen, which is likely why the VRAM detection logic is getting thrown off.

GiteaMirror commented

@rick-github commented on GitHub (Oct 8, 2024):

I was able to duplicate this with one model.

ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished"
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s
ollama-1  | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama-1  | [GIN] 2024/10/08 - 20:16:00 | 200 |  704.668383ms |      172.21.0.1 | POST     "/api/embed"
ollama-1  | [GIN] 2024/10/08 - 20:16:09 | 200 |      16.724µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:19 | 200 |      29.061µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:29 | 200 |      17.811µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.526µs |      172.21.0.3 | HEAD     "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      88.317µs |      172.21.0.3 | GET      "/api/ps"
ollama-1  | [GIN] 2024/10/08 - 20:16:39 | 200 |      18.191µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:49 | 200 |      21.965µs |             ::1 | GET      "/"
ollama-1  | [GIN] 2024/10/08 - 20:16:59 | 200 |      17.751µs |             ::1 | GET      "/"
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama-1  | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama-1  | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama-1  | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"
ollama-1  | releasing cuda driver library
ollama-1  | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB"
ollama-1  | CUDA driver version: 12.2
ollama-1  | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB"

At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.

@rick-github commented on GitHub (Oct 8, 2024): I was able to duplicate this with one model. ``` ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:466 msg="context for request finished" ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=1m0s ollama-1 | time=2024-10-08T20:16:00.487Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama-1 | [GIN] 2024/10/08 - 20:16:00 | 200 | 704.668383ms | 172.21.0.1 | POST "/api/embed" ollama-1 | [GIN] 2024/10/08 - 20:16:09 | 200 | 16.724µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:19 | 200 | 29.061µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:29 | 200 | 17.811µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.526µs | 172.21.0.3 | HEAD "/" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 88.317µs | 172.21.0.3 | GET "/api/ps" ollama-1 | [GIN] 2024/10/08 - 20:16:39 | 200 | 18.191µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:49 | 200 | 21.965µs | ::1 | GET "/" ollama-1 | [GIN] 2024/10/08 - 20:16:59 | 200 | 17.751µs | ::1 | GET "/" ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.488Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.4 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:00.547Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama-1 | time=2024-10-08T20:17:00.548Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama-1 | time=2024-10-08T20:17:00.573Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama-1 | time=2024-10-08T20:17:00.799Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.4 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:00.847Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="14.8 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ollama-1 | releasing cuda driver library ollama-1 | time=2024-10-08T20:17:01.048Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="62.5 GiB" before.free="60.7 GiB" before.free_swap="130.0 GiB" now.total="62.5 GiB" now.free="60.7 GiB" now.free_swap="130.0 GiB" ollama-1 | CUDA driver version: 12.2 ollama-1 | time=2024-10-08T20:17:01.100Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-6d37f28b-7662-6da1-7c11-2f02633112f2 name="NVIDIA GeForce RTX 3080 Laptop GPU" overhead="0 B" before.total="15.7 GiB" before.free="15.6 GiB" now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB" ``` At time=2024-10-08T20:17:00.547Z now.used="992.7 MiB", the runner is released, at time=2024-10-08T20:17:00.847Z now.used="164.3 MiB" yet ollama continues to wait for the VRAM usage to decrease.

GiteaMirror commented

@dhiltgen commented on GitHub (Oct 8, 2024):

Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was.

At 2024-10-08T20:17:00.547Z the GPU has now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"
Then the subprocess is terminated
At 2024-10-08T20:17:01.100Z the GPU has now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB

@dhiltgen commented on GitHub (Oct 8, 2024): Hmm... @rick-github did you see logs saying "gpu VRAM usage didn't recover" as well that you didn't paste in the log above? I would have expected a "gpu VRAM free memory converged ..." log output though. The scheduler takes a snapshot of memory before starting the actual unload of the process, then checks again for the memory to drop, and your logs seem to show that happening, but I can't tell what the model size was. - At `2024-10-08T20:17:00.547Z` the GPU has `now.total="15.7 GiB" now.free="14.8 GiB" now.used="992.7 MiB"` - Then the subprocess is terminated - At `2024-10-08T20:17:01.100Z` the GPU has `now.total="15.7 GiB" now.free="15.6 GiB" now.used="164.3 MiB`

GiteaMirror commented

@rick-github commented on GitHub (Oct 8, 2024):

ollama-1  | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d

The model was mxbai-embed-large:latest.

@rick-github commented on GitHub (Oct 8, 2024): ``` ollama-1 | time=2024-10-08T20:17:05.548Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.059729676 model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ``` The model was mxbai-embed-large:latest.

GiteaMirror commented

@rick-github commented on GitHub (Oct 8, 2024):

FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070:

ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s
ollama  | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
ollama  | [GIN] 2024/10/08 - 21:18:43 | 200 |       24.74µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:18:53 | 200 |      16.985µs |       127.0.0.1 | GET      "/"
ollama  | [GIN] 2024/10/08 - 21:19:03 | 200 |       20.75µs |       127.0.0.1 | GET      "/"
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server"
ollama  | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped"
ollama  | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB"
ollama  | CUDA driver version: 12.0
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB"
ollama  | releasing cuda driver library
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
ollama  | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

@rick-github commented on GitHub (Oct 8, 2024): FYI, the above experiment was run on a 3080. I was unable to duplicate on a 4070: ``` ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=30s ollama | time=2024-10-08T21:18:35.554Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 ollama | [GIN] 2024/10/08 - 21:18:43 | 200 | 24.74µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:18:53 | 200 | 16.985µs | 127.0.0.1 | GET "/" ollama | [GIN] 2024/10/08 - 21:19:03 | 200 | 20.75µs | 127.0.0.1 | GET "/" ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.555Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="86.7 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="85.8 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="10.5 GiB" now.total="11.7 GiB" now.free="9.4 GiB" now.used="2.4 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1044 msg="stopping llama server" ollama | time=2024-10-08T21:19:05.611Z level=DEBUG source=server.go:1050 msg="waiting for llama server to exit" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=server.go:1054 msg="llama server stopped" ollama | time=2024-10-08T21:19:05.678Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.862Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="94.0 GiB" before.free="85.8 GiB" before.free_swap="245.8 GiB" now.total="94.0 GiB" now.free="86.7 GiB" now.free_swap="245.8 GiB" ollama | CUDA driver version: 12.0 ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.7 GiB" before.free="9.4 GiB" now.total="11.7 GiB" now.free="10.5 GiB" now.used="1.2 GiB" ollama | releasing cuda driver library ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.36 seconds" model=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d ollama | time=2024-10-08T21:19:05.919Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" ```

GiteaMirror commented

@dhiltgen commented on GitHub (Oct 8, 2024):

I think this is specific to this model and our prediction being too high...

% ollama ps                           
NAME                        ID              SIZE      PROCESSOR    UNTIL              
mxbai-embed-large:latest    468836162de7    1.2 GB    100% GPU     4 minutes from now

However:

% nvidia-smi
Tue Oct  8 22:32:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GT 1030         Off |   00000000:01:00.0 Off |                  N/A |
| 35%   36C    P0             N/A /   19W |     716MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     89203      C   ...unners/cuda_v12/ollama_llama_server        712MiB |
+-----------------------------------------------------------------------------------------+

We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.

@dhiltgen commented on GitHub (Oct 8, 2024): I think this is specific to this model and our prediction being too high... ``` % ollama ps NAME ID SIZE PROCESSOR UNTIL mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 4 minutes from now ``` However: ``` % nvidia-smi Tue Oct 8 22:32:17 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce GT 1030 Off | 00000000:01:00.0 Off | N/A | | 35% 36C P0 N/A / 19W | 716MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 89203 C ...unners/cuda_v12/ollama_llama_server 712MiB | +-----------------------------------------------------------------------------------------+ ``` We're aiming for ~80% recovery, but that would require ~960M so the unload isn't enough to satisfy the algo.

GiteaMirror commented

@maxruby commented on GitHub (Oct 27, 2024):

Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?

@maxruby commented on GitHub (Oct 27, 2024): Sorry to ask again, but its 3 weeks now, Can I help in any way to support in this PR?

GiteaMirror commented