[GH-ISSUE #5821] Gemma 2 runs too slow #29387

Open
opened 2026-04-22 08:13:55 -05:00 by GiteaMirror · 19 comments
Owner

Originally created by @AeneasZhu on GitHub (Jul 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5821

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

After ollama's upgrade to 0.27 from 0.20, it runs gemma 2 9b at very low speed. I don't think the OS is out of vram, since gemma 2 only costs 6.8G (q_4_0) vram while my laptop has 8G vram. However, other 9b models with q4_0 run like glm4 run very smoothly. Is that a bug of ollama or gemma 2 itself?

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.27

Originally created by @AeneasZhu on GitHub (Jul 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5821 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? After ollama's upgrade to 0.27 from 0.20, it runs gemma 2 9b at very low speed. I don't think the OS is out of vram, since gemma 2 only costs 6.8G (q_4_0) vram while my laptop has 8G vram. However, other 9b models with q4_0 run like glm4 run very smoothly. Is that a bug of ollama or gemma 2 itself? ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.27
GiteaMirror added the memorynvidiabug labels 2026-04-22 08:13:56 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 20, 2024):

Server logs would be helpful to diagnose the issue.

<!-- gh-comment-id:2241337329 --> @rick-github commented on GitHub (Jul 20, 2024): Server logs would be helpful to diagnose the issue.
Author
Owner

@AeneasZhu commented on GitHub (Jul 21, 2024):

Server logs would be helpful to diagnose the issue.

2024/07/21 07:09:47 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-21T07:09:47.466+08:00 level=INFO source=images.go:778 msg="total blobs: 77"
time=2024-07-21T07:09:47.559+08:00 level=INFO source=images.go:785 msg="total unused blobs removed: 0"
time=2024-07-21T07:09:47.562+08:00 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.2.7)"
time=2024-07-21T07:09:47.566+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11.3 rocm_v6.1 cpu cpu_avx cpu_avx2]"
time=2024-07-21T07:09:47.567+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T07:09:48.289+08:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" overhead="918.0 MiB"
time=2024-07-21T07:09:48.294+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB"
[GIN] 2024/07/21 - 07:10:23 | 200 |      1.1431ms |       127.0.0.1 | HEAD     "/"
time=2024-07-21T07:10:23.868+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified."
[GIN] 2024/07/21 - 07:10:23 | 200 |     59.0374ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/21 - 07:10:39 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:10:39 | 200 |     41.6721ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:10:39.312+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=40 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.8 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T07:10:39.325+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 40 --no-mmap --parallel 1 --port 49780"
time=2024-07-21T07:10:39.357+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T07:10:39.359+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T07:10:39.360+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="13488" timestamp=1721517040
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="13488" timestamp=1721517040 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="49780" tid="13488" timestamp=1721517040
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = merged
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_K:  252 tensors
llama_model_loader: - type q6_K:   43 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.36 GiB (4.98 BPW) 
llm_load_print_meta: general.name     = merged
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
time=2024-07-21T07:10:40.379+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1677.17 MiB
llm_load_tensors:      CUDA0 buffer size =  4529.00 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    32.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   640.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 30
INFO [wmain] model loaded | tid="13488" timestamp=1721517046
time=2024-07-21T07:10:46.468+08:00 level=INFO source=server.go:617 msg="llama runner started in 7.11 seconds"
[GIN] 2024/07/21 - 07:10:46 | 200 |    7.2349256s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:17:43 | 200 |         5m40s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:18:01 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:18:01 | 200 |     85.0611ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:18:02.142+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="48.8 MiB"
time=2024-07-21T07:18:03.339+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.8 GiB]" memory.required.full="6.8 GiB" memory.required.partial="6.1 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.4 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T07:18:03.350+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ba678f3760a834f86247d0fd1ad0ff6d62ba9b030774d0c1bf1c38835979b2d4 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 50143"
time=2024-07-21T07:18:03.389+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T07:18:03.389+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T07:18:03.400+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="11612" timestamp=1721517484
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="11612" timestamp=1721517484 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="50143" tid="11612" timestamp=1721517484
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ba678f3760a834f86247d0fd1ad0ff6d62ba9b030774d0c1bf1c38835979b2d4 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = merged
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 12
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
time=2024-07-21T07:18:04.948+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q3_K:  168 tensors
llama_model_loader: - type q4_K:  122 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 4.43 GiB (4.12 BPW) 
llm_load_print_meta: general.name     = merged
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloaded 42/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1435.56 MiB
llm_load_tensors:      CUDA0 buffer size =  3817.62 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   672.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 4
INFO [wmain] model loaded | tid="11612" timestamp=1721517491
time=2024-07-21T07:18:11.866+08:00 level=INFO source=server.go:617 msg="llama runner started in 8.48 seconds"
[GIN] 2024/07/21 - 07:18:11 | 200 |      9.90228s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:22:14 | 200 |         3m58s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:22:25 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:22:25 | 200 |    136.1339ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:22:25.318+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="640.9 MiB"
time=2024-07-21T07:22:26.450+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[6.7 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.7 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.7 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T07:22:26.462+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 50231"
time=2024-07-21T07:22:26.526+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T07:22:26.526+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T07:22:26.529+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="9252" timestamp=1721517748
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="9252" timestamp=1721517748 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="50231" tid="9252" timestamp=1721517748
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
time=2024-07-21T07:22:28.117+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_0:  294 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW) 
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 41 repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1541.93 MiB
llm_load_tensors:      CUDA0 buffer size =  4361.05 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    16.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   656.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 17
INFO [wmain] model loaded | tid="9252" timestamp=1721517757
time=2024-07-21T07:22:38.050+08:00 level=INFO source=server.go:617 msg="llama runner started in 11.52 seconds"
[GIN] 2024/07/21 - 07:22:38 | 200 |   12.9035189s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:25:36 | 200 |   59.9902702s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:25:48 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:25:48 | 200 |     62.6399ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/07/21 - 07:26:40 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:26:41 | 200 |    1.3670033s |       127.0.0.1 | DELETE   "/api/delete"
[GIN] 2024/07/21 - 07:26:48 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2024-07-21T07:26:48.291+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified."
[GIN] 2024/07/21 - 07:26:48 | 200 |    171.5556ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/21 - 07:26:59 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:26:59 | 404 |            0s |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:27:32.046+08:00 level=INFO source=images.go:1047 msg="request failed: Head \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232702Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=058dc757384f0a42d66693feb1e1e2f95cbe7e4925e98b1e2d4e0331b631abf3\": dial tcp 104.18.8.90:443: i/o timeout"
[GIN] 2024/07/21 - 07:27:32 | 200 |   32.7627094s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/07/21 - 07:27:48 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:27:48 | 404 |            0s |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:28:06.136+08:00 level=INFO source=download.go:136 msg="downloading ff1d1fc78170 in 55 100 MB part(s)"
time=2024-07-21T07:28:37.059+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.059+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.061+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 2 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.062+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 31 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 54 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 38 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 4 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.091+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.091+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 22 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.106+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.106+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 45 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 1 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 35 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 50 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 24 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 13 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.138+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.138+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 10 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:28:37.196+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout"
time=2024-07-21T07:28:37.196+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 25 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s"
time=2024-07-21T07:31:34.243+08:00 level=INFO source=images.go:1047 msg="request failed: Head \"https://registry.ollama.ai/v2/library/gemma2/blobs/sha256:109037bec39c0becc8221222ae23557559bc594290945a2c4221ab4f303b8871\": dial tcp 172.67.182.229:443: i/o timeout"
[GIN] 2024/07/21 - 07:31:34 | 200 |         3m45s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/07/21 - 07:31:54 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 07:31:54 | 404 |       503.3µs |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:32:11.792+08:00 level=INFO source=download.go:136 msg="downloading 109037bec39c in 1 136 B part(s)"
time=2024-07-21T07:32:15.018+08:00 level=INFO source=download.go:136 msg="downloading 097a36493f71 in 1 8.4 KB part(s)"
time=2024-07-21T07:32:18.187+08:00 level=INFO source=download.go:136 msg="downloading 10aa81da732e in 1 487 B part(s)"
[GIN] 2024/07/21 - 07:32:20 | 200 |   26.1686553s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/07/21 - 07:32:20 | 200 |     80.6173ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T07:32:20.522+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[6.7 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.7 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.7 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T07:32:20.534+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 53069"
time=2024-07-21T07:32:20.538+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T07:32:20.538+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T07:32:20.540+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3440" timestamp=1721518340
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3440" timestamp=1721518340 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="53069" tid="3440" timestamp=1721518340
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_0:  294 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-21T07:32:20.793+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW) 
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 41 repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1541.93 MiB
llm_load_tensors:      CUDA0 buffer size =  4361.05 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    16.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   656.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 17
INFO [wmain] model loaded | tid="3440" timestamp=1721518348
time=2024-07-21T07:32:28.977+08:00 level=INFO source=server.go:617 msg="llama runner started in 8.44 seconds"
[GIN] 2024/07/21 - 07:32:28 | 200 |    8.6336192s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:35:37 | 200 |   53.5176442s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 07:35:51 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2024-07-21T07:35:51.696+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified."
[GIN] 2024/07/21 - 07:35:51 | 200 |      21.012ms |       127.0.0.1 | GET      "/api/tags"
<!-- gh-comment-id:2241348982 --> @AeneasZhu commented on GitHub (Jul 21, 2024): > Server logs would be helpful to diagnose the issue. ``` 2024/07/21 07:09:47 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-21T07:09:47.466+08:00 level=INFO source=images.go:778 msg="total blobs: 77" time=2024-07-21T07:09:47.559+08:00 level=INFO source=images.go:785 msg="total unused blobs removed: 0" time=2024-07-21T07:09:47.562+08:00 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.2.7)" time=2024-07-21T07:09:47.566+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11.3 rocm_v6.1 cpu cpu_avx cpu_avx2]" time=2024-07-21T07:09:47.567+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-21T07:09:48.289+08:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" overhead="918.0 MiB" time=2024-07-21T07:09:48.294+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB" [GIN] 2024/07/21 - 07:10:23 | 200 | 1.1431ms | 127.0.0.1 | HEAD "/" time=2024-07-21T07:10:23.868+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified." [GIN] 2024/07/21 - 07:10:23 | 200 | 59.0374ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/07/21 - 07:10:39 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:10:39 | 200 | 41.6721ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:10:39.312+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=40 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.8 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T07:10:39.325+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 40 --no-mmap --parallel 1 --port 49780" time=2024-07-21T07:10:39.357+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T07:10:39.359+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T07:10:39.360+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="13488" timestamp=1721517040 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="13488" timestamp=1721517040 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="49780" tid="13488" timestamp=1721517040 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = merged llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_K: 252 tensors llama_model_loader: - type q6_K: 43 tensors llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.36 GiB (4.98 BPW) llm_load_print_meta: general.name = merged llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 time=2024-07-21T07:10:40.379+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloaded 40/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1677.17 MiB llm_load_tensors: CUDA0 buffer size = 4529.00 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 640.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 30 INFO [wmain] model loaded | tid="13488" timestamp=1721517046 time=2024-07-21T07:10:46.468+08:00 level=INFO source=server.go:617 msg="llama runner started in 7.11 seconds" [GIN] 2024/07/21 - 07:10:46 | 200 | 7.2349256s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:17:43 | 200 | 5m40s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:18:01 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:18:01 | 200 | 85.0611ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:18:02.142+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="48.8 MiB" time=2024-07-21T07:18:03.339+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.8 GiB]" memory.required.full="6.8 GiB" memory.required.partial="6.1 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.4 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T07:18:03.350+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ba678f3760a834f86247d0fd1ad0ff6d62ba9b030774d0c1bf1c38835979b2d4 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 50143" time=2024-07-21T07:18:03.389+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T07:18:03.389+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T07:18:03.400+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="11612" timestamp=1721517484 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="11612" timestamp=1721517484 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="50143" tid="11612" timestamp=1721517484 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ba678f3760a834f86247d0fd1ad0ff6d62ba9b030774d0c1bf1c38835979b2d4 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = merged llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 12 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... time=2024-07-21T07:18:04.948+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q3_K: 168 tensors llama_model_loader: - type q4_K: 122 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q3_K - Medium llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 4.43 GiB (4.12 BPW) llm_load_print_meta: general.name = merged llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 42 repeating layers to GPU llm_load_tensors: offloaded 42/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1435.56 MiB llm_load_tensors: CUDA0 buffer size = 3817.62 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 672.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 4 INFO [wmain] model loaded | tid="11612" timestamp=1721517491 time=2024-07-21T07:18:11.866+08:00 level=INFO source=server.go:617 msg="llama runner started in 8.48 seconds" [GIN] 2024/07/21 - 07:18:11 | 200 | 9.90228s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:22:14 | 200 | 3m58s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:22:25 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:22:25 | 200 | 136.1339ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:22:25.318+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="640.9 MiB" time=2024-07-21T07:22:26.450+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[6.7 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.7 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.7 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T07:22:26.462+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 50231" time=2024-07-21T07:22:26.526+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T07:22:26.526+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T07:22:26.529+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="9252" timestamp=1721517748 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="9252" timestamp=1721517748 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="50231" tid="9252" timestamp=1721517748 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-9b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... time=2024-07-21T07:22:28.117+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_0: 294 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) llm_load_print_meta: general.name = gemma-2-9b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 41 repeating layers to GPU llm_load_tensors: offloaded 41/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1541.93 MiB llm_load_tensors: CUDA0 buffer size = 4361.05 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 16.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 656.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 17 INFO [wmain] model loaded | tid="9252" timestamp=1721517757 time=2024-07-21T07:22:38.050+08:00 level=INFO source=server.go:617 msg="llama runner started in 11.52 seconds" [GIN] 2024/07/21 - 07:22:38 | 200 | 12.9035189s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:25:36 | 200 | 59.9902702s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:25:48 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:25:48 | 200 | 62.6399ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/07/21 - 07:26:40 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:26:41 | 200 | 1.3670033s | 127.0.0.1 | DELETE "/api/delete" [GIN] 2024/07/21 - 07:26:48 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2024-07-21T07:26:48.291+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified." [GIN] 2024/07/21 - 07:26:48 | 200 | 171.5556ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/07/21 - 07:26:59 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:26:59 | 404 | 0s | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:27:32.046+08:00 level=INFO source=images.go:1047 msg="request failed: Head \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232702Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=058dc757384f0a42d66693feb1e1e2f95cbe7e4925e98b1e2d4e0331b631abf3\": dial tcp 104.18.8.90:443: i/o timeout" [GIN] 2024/07/21 - 07:27:32 | 200 | 32.7627094s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/07/21 - 07:27:48 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:27:48 | 404 | 0s | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:28:06.136+08:00 level=INFO source=download.go:136 msg="downloading ff1d1fc78170 in 55 100 MB part(s)" time=2024-07-21T07:28:37.059+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.059+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.061+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 2 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.062+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 31 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.075+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 54 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 38 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.075+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 4 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.091+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.091+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 22 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.106+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.106+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 45 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 1 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 35 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 50 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.122+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 24 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.122+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 13 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.138+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.138+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 10 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:28:37.196+08:00 level=INFO source=images.go:1047 msg="request failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout" time=2024-07-21T07:28:37.196+08:00 level=INFO source=download.go:178 msg="ff1d1fc78170 part 25 attempt 0 failed: Get \"https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/ff/ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240720%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240720T232807Z&X-Amz-Expires=1200&X-Amz-SignedHeaders=host&X-Amz-Signature=428592c40b79461663b67f7dfef07ad8b400cc6024fa095eb10494afd744febb\": dial tcp 104.18.8.90:443: i/o timeout, retrying in 1s" time=2024-07-21T07:31:34.243+08:00 level=INFO source=images.go:1047 msg="request failed: Head \"https://registry.ollama.ai/v2/library/gemma2/blobs/sha256:109037bec39c0becc8221222ae23557559bc594290945a2c4221ab4f303b8871\": dial tcp 172.67.182.229:443: i/o timeout" [GIN] 2024/07/21 - 07:31:34 | 200 | 3m45s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/07/21 - 07:31:54 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 07:31:54 | 404 | 503.3µs | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:32:11.792+08:00 level=INFO source=download.go:136 msg="downloading 109037bec39c in 1 136 B part(s)" time=2024-07-21T07:32:15.018+08:00 level=INFO source=download.go:136 msg="downloading 097a36493f71 in 1 8.4 KB part(s)" time=2024-07-21T07:32:18.187+08:00 level=INFO source=download.go:136 msg="downloading 10aa81da732e in 1 487 B part(s)" [GIN] 2024/07/21 - 07:32:20 | 200 | 26.1686553s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/07/21 - 07:32:20 | 200 | 80.6173ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T07:32:20.522+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[6.7 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.7 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.7 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T07:32:20.534+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 53069" time=2024-07-21T07:32:20.538+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T07:32:20.538+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T07:32:20.540+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3440" timestamp=1721518340 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3440" timestamp=1721518340 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="53069" tid="3440" timestamp=1721518340 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-9b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_0: 294 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-21T07:32:20.793+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) llm_load_print_meta: general.name = gemma-2-9b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 41 repeating layers to GPU llm_load_tensors: offloaded 41/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1541.93 MiB llm_load_tensors: CUDA0 buffer size = 4361.05 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 16.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 656.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 17 INFO [wmain] model loaded | tid="3440" timestamp=1721518348 time=2024-07-21T07:32:28.977+08:00 level=INFO source=server.go:617 msg="llama runner started in 8.44 seconds" [GIN] 2024/07/21 - 07:32:28 | 200 | 8.6336192s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:35:37 | 200 | 53.5176442s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 07:35:51 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2024-07-21T07:35:51.696+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified." [GIN] 2024/07/21 - 07:35:51 | 200 | 21.012ms | 127.0.0.1 | GET "/api/tags" ```
Author
Owner

@rick-github commented on GitHub (Jul 21, 2024):

Not all layers are being loaded onto the GPU in 0.2.7:

llm_load_tensors: offloading 41 repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU

Do you have corresponding logs for 0.2.0?

<!-- gh-comment-id:2241358648 --> @rick-github commented on GitHub (Jul 21, 2024): Not all layers are being loaded onto the GPU in 0.2.7: ``` llm_load_tensors: offloading 41 repeating layers to GPU llm_load_tensors: offloaded 41/43 layers to GPU ``` Do you have corresponding logs for 0.2.0?
Author
Owner

@rick-github commented on GitHub (Jul 21, 2024):

Not all 8G on your card is available for ollama: memory.available="[6.8 GiB]". What's the output of nvidia-smi?

<!-- gh-comment-id:2241363900 --> @rick-github commented on GitHub (Jul 21, 2024): Not all 8G on your card is available for ollama: `memory.available="[6.8 GiB]"`. What's the output of `nvidia-smi`?
Author
Owner

@AeneasZhu commented on GitHub (Jul 21, 2024):

Not all layers are being loaded onto the GPU in 0.2.7:

llm_load_tensors: offloading 41 repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU

Do you have corresponding logs for 0.2.0?

2024/07/21 09:23:25 routes.go:1033: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-21T09:23:25.512+08:00 level=INFO source=images.go:751 msg="total blobs: 77"
time=2024-07-21T09:23:25.515+08:00 level=INFO source=images.go:758 msg="total unused blobs removed: 0"
time=2024-07-21T09:23:25.518+08:00 level=INFO source=routes.go:1080 msg="Listening on 127.0.0.1:11434 (version 0.2.0)"
time=2024-07-21T09:23:25.518+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v5.7]"
time=2024-07-21T09:23:25.518+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T09:23:26.766+08:00 level=INFO source=types.go:103 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB"
[GIN] 2024/07/21 - 09:23:39 | 200 |       562.4µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 09:23:39 | 200 |      52.121ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T09:23:40.138+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[7.5 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T09:23:40.157+08:00 level=INFO source=server.go:375 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 57976"
time=2024-07-21T09:23:40.225+08:00 level=INFO source=sched.go:477 msg="loaded runners" count=1
time=2024-07-21T09:23:40.225+08:00 level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2024-07-21T09:23:40.225+08:00 level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="2304" timestamp=1721525021
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="2304" timestamp=1721525021 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="57976" tid="2304" timestamp=1721525021
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
time=2024-07-21T09:23:41.755+08:00 level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_0:  294 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW) 
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloaded 42/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1435.56 MiB
llm_load_tensors:      CUDA0 buffer size =  4467.42 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   672.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 4
INFO [wmain] model loaded | tid="2304" timestamp=1721525031
time=2024-07-21T09:23:51.769+08:00 level=INFO source=server.go:609 msg="llama runner started in 11.54 seconds"
[GIN] 2024/07/21 - 09:23:51 | 200 |   11.7944928s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 09:25:32 | 200 |   28.1855238s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2241369853 --> @AeneasZhu commented on GitHub (Jul 21, 2024): > Not all layers are being loaded onto the GPU in 0.2.7: > > ``` > llm_load_tensors: offloading 41 repeating layers to GPU > llm_load_tensors: offloaded 41/43 layers to GPU > ``` > > Do you have corresponding logs for 0.2.0? ``` 2024/07/21 09:23:25 routes.go:1033: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-21T09:23:25.512+08:00 level=INFO source=images.go:751 msg="total blobs: 77" time=2024-07-21T09:23:25.515+08:00 level=INFO source=images.go:758 msg="total unused blobs removed: 0" time=2024-07-21T09:23:25.518+08:00 level=INFO source=routes.go:1080 msg="Listening on 127.0.0.1:11434 (version 0.2.0)" time=2024-07-21T09:23:25.518+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v5.7]" time=2024-07-21T09:23:25.518+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-21T09:23:26.766+08:00 level=INFO source=types.go:103 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB" [GIN] 2024/07/21 - 09:23:39 | 200 | 562.4µs | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 09:23:39 | 200 | 52.121ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T09:23:40.138+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[7.5 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T09:23:40.157+08:00 level=INFO source=server.go:375 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 57976" time=2024-07-21T09:23:40.225+08:00 level=INFO source=sched.go:477 msg="loaded runners" count=1 time=2024-07-21T09:23:40.225+08:00 level=INFO source=server.go:563 msg="waiting for llama runner to start responding" time=2024-07-21T09:23:40.225+08:00 level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="2304" timestamp=1721525021 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="2304" timestamp=1721525021 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="57976" tid="2304" timestamp=1721525021 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-9b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... time=2024-07-21T09:23:41.755+08:00 level=INFO source=server.go:604 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_0: 294 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) llm_load_print_meta: general.name = gemma-2-9b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 42 repeating layers to GPU llm_load_tensors: offloaded 42/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1435.56 MiB llm_load_tensors: CUDA0 buffer size = 4467.42 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 672.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 4 INFO [wmain] model loaded | tid="2304" timestamp=1721525031 time=2024-07-21T09:23:51.769+08:00 level=INFO source=server.go:609 msg="llama runner started in 11.54 seconds" [GIN] 2024/07/21 - 09:23:51 | 200 | 11.7944928s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 09:25:32 | 200 | 28.1855238s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@AeneasZhu commented on GitHub (Jul 21, 2024):

Not all 8G on your card is available for ollama: memory.available="[6.8 GiB]". What's the output of nvidia-smi?

0.20 did run very smoothly

<!-- gh-comment-id:2241369944 --> @AeneasZhu commented on GitHub (Jul 21, 2024): > Not all 8G on your card is available for ollama: `memory.available="[6.8 GiB]"`. What's the output of `nvidia-smi`? 0.20 did run very smoothly
Author
Owner

@AeneasZhu commented on GitHub (Jul 21, 2024):

After I uninstall and reinstall 0.27, it runs gemma2 smoothly as well. I think it's better to uninstall and reinstall ollama rather than upgrade directly.

Here is the log:

2024/07/21 09:32:08 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-21T09:32:08.218+08:00 level=INFO source=images.go:778 msg="total blobs: 77"
time=2024-07-21T09:32:08.221+08:00 level=INFO source=images.go:785 msg="total unused blobs removed: 0"
time=2024-07-21T09:32:08.224+08:00 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.2.7)"
time=2024-07-21T09:32:08.225+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v6.1 cpu cpu_avx]"
time=2024-07-21T09:32:08.225+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T09:32:08.556+08:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" overhead="641.5 MiB"
time=2024-07-21T09:32:08.557+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB"
[GIN] 2024/07/21 - 09:32:15 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/07/21 - 09:32:23 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 09:32:23 | 200 |     19.0269ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T09:32:24.040+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T09:32:24.045+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 59469"
time=2024-07-21T09:32:24.108+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T09:32:24.110+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T09:32:24.111+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3592" timestamp=1721525545
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3592" timestamp=1721525545 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="59469" tid="3592" timestamp=1721525545
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
time=2024-07-21T09:32:25.884+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_0:  294 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW) 
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloaded 42/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1435.56 MiB
llm_load_tensors:      CUDA0 buffer size =  4467.42 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   672.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 4
INFO [wmain] model loaded | tid="3592" timestamp=1721525550
time=2024-07-21T09:32:30.170+08:00 level=INFO source=server.go:617 msg="llama runner started in 6.06 seconds"
[GIN] 2024/07/21 - 09:32:30 | 200 |    6.1944242s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 09:35:09 | 200 |   21.2028767s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 09:35:19 | 200 |            0s |       127.0.0.1 | HEAD     "/"
time=2024-07-21T09:35:19.577+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified."
[GIN] 2024/07/21 - 09:35:19 | 200 |      4.9597ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/07/21 - 09:35:30 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/07/21 - 09:35:30 | 200 |     36.1565ms |       127.0.0.1 | POST     "/api/show"
time=2024-07-21T09:35:30.566+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="241.7 MiB"
time=2024-07-21T09:35:31.526+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[7.0 GiB]" memory.required.full="7.8 GiB" memory.required.partial="7.0 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[7.0 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-07-21T09:35:31.529+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 60108"
time=2024-07-21T09:35:31.534+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T09:35:31.534+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T09:35:31.534+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="16724" timestamp=1721525731
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="16724" timestamp=1721525731 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="60108" tid="16724" timestamp=1721525731
llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = merged
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q4_K:  252 tensors
llama_model_loader: - type q6_K:   43 tensors
time=2024-07-21T09:35:31.787+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 5.36 GiB (4.98 BPW) 
llm_load_print_meta: general.name     = merged
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 41 repeating layers to GPU
llm_load_tensors: offloaded 41/43 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1556.37 MiB
llm_load_tensors:      CUDA0 buffer size =  4649.80 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    16.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   656.00 MiB
llama_new_context_with_model: KV self size  =  672.00 MiB, K (f16):  336.00 MiB, V (f16):  336.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1224.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 17
INFO [wmain] model loaded | tid="16724" timestamp=1721525737
time=2024-07-21T09:35:37.353+08:00 level=INFO source=server.go:617 msg="llama runner started in 5.82 seconds"
[GIN] 2024/07/21 - 09:35:37 | 200 |    6.8512463s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 09:37:04 | 200 |   32.4115701s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/07/21 - 09:38:30 | 200 |   35.5102982s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2241375138 --> @AeneasZhu commented on GitHub (Jul 21, 2024): After I uninstall and reinstall 0.27, it runs gemma2 smoothly as well. I think it's better to uninstall and reinstall ollama rather than upgrade directly. Here is the log: ``` 2024/07/21 09:32:08 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:D:\\AGI\\ollama_models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-21T09:32:08.218+08:00 level=INFO source=images.go:778 msg="total blobs: 77" time=2024-07-21T09:32:08.221+08:00 level=INFO source=images.go:785 msg="total unused blobs removed: 0" time=2024-07-21T09:32:08.224+08:00 level=INFO source=routes.go:1143 msg="Listening on 127.0.0.1:11434 (version 0.2.7)" time=2024-07-21T09:32:08.225+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v6.1 cpu cpu_avx]" time=2024-07-21T09:32:08.225+08:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-21T09:32:08.556+08:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" overhead="641.5 MiB" time=2024-07-21T09:32:08.557+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda compute=8.6 driver=12.1 name="NVIDIA GeForce RTX 3070 Ti Laptop GPU" total="8.0 GiB" available="6.9 GiB" [GIN] 2024/07/21 - 09:32:15 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2024/07/21 - 09:32:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 09:32:23 | 200 | 19.0269ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T09:32:24.040+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T09:32:24.045+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 59469" time=2024-07-21T09:32:24.108+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T09:32:24.110+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T09:32:24.111+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3592" timestamp=1721525545 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3592" timestamp=1721525545 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="59469" tid="3592" timestamp=1721525545 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-9b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... time=2024-07-21T09:32:25.884+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_0: 294 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) llm_load_print_meta: general.name = gemma-2-9b-it llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 42 repeating layers to GPU llm_load_tensors: offloaded 42/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1435.56 MiB llm_load_tensors: CUDA0 buffer size = 4467.42 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 672.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 4 INFO [wmain] model loaded | tid="3592" timestamp=1721525550 time=2024-07-21T09:32:30.170+08:00 level=INFO source=server.go:617 msg="llama runner started in 6.06 seconds" [GIN] 2024/07/21 - 09:32:30 | 200 | 6.1944242s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 09:35:09 | 200 | 21.2028767s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 09:35:19 | 200 | 0s | 127.0.0.1 | HEAD "/" time=2024-07-21T09:35:19.577+08:00 level=WARN source=routes.go:817 msg="bad manifest filepath" name=hub/bacx/studybuddy:latest error="open D:\\AGI\\ollama_models\\blobs\\sha256-c65468c33ec86e462ef2a5eff135cbe40b4e7179b72806048034ccc9dd671eb6: The system cannot find the file specified." [GIN] 2024/07/21 - 09:35:19 | 200 | 4.9597ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/07/21 - 09:35:30 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/21 - 09:35:30 | 200 | 36.1565ms | 127.0.0.1 | POST "/api/show" time=2024-07-21T09:35:30.566+08:00 level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-59be21cf-1a6f-4733-e579-d85deb64d686 library=cuda total="8.0 GiB" available="241.7 MiB" time=2024-07-21T09:35:31.526+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=41 layers.split="" memory.available="[7.0 GiB]" memory.required.full="7.8 GiB" memory.required.partial="7.0 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[7.0 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.6 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-07-21T09:35:31.529+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 60108" time=2024-07-21T09:35:31.534+08:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T09:35:31.534+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T09:35:31.534+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="16724" timestamp=1721525731 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="16724" timestamp=1721525731 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="60108" tid="16724" timestamp=1721525731 llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from D:\AGI\ollama_models\blobs\sha256-befd260af00133c21746d65696658a69103b53287fee1a6d544e8f972de05d67 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = merged llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 llama_model_loader: - kv 4: gemma2.block_count u32 = 42 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 169 tensors llama_model_loader: - type q4_K: 252 tensors llama_model_loader: - type q6_K: 43 tensors time=2024-07-21T09:35:31.787+08:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 364 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_layer = 42 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.24 B llm_load_print_meta: model size = 5.36 GiB (4.98 BPW) llm_load_print_meta: general.name = merged llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 93 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.41 MiB llm_load_tensors: offloading 41 repeating layers to GPU llm_load_tensors: offloaded 41/43 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1556.37 MiB llm_load_tensors: CUDA0 buffer size = 4649.80 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 16.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 656.00 MiB llama_new_context_with_model: KV self size = 672.00 MiB, K (f16): 336.00 MiB, V (f16): 336.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1224.77 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1690 llama_new_context_with_model: graph splits = 17 INFO [wmain] model loaded | tid="16724" timestamp=1721525737 time=2024-07-21T09:35:37.353+08:00 level=INFO source=server.go:617 msg="llama runner started in 5.82 seconds" [GIN] 2024/07/21 - 09:35:37 | 200 | 6.8512463s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 09:37:04 | 200 | 32.4115701s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/21 - 09:38:30 | 200 | 35.5102982s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

@AeneasZhu from the logs, it looks like it almost loaded the whole model, fluctuating between 41 and 42 out of 43 layers. As others have noted, this could be down to other apps running on the GPU varying over time. Can you clarify what token rates you're seeing in these scenarios?

<!-- gh-comment-id:2243950526 --> @dhiltgen commented on GitHub (Jul 22, 2024): @AeneasZhu from the logs, it looks like it almost loaded the whole model, fluctuating between 41 and 42 out of 43 layers. As others have noted, this could be down to other apps running on the GPU varying over time. Can you clarify what token rates you're seeing in these scenarios?
Author
Owner

@nutspiano commented on GitHub (Jul 24, 2024):

Something seems off with the required VRAM calculations. I ran into this when trying to find my max context length on the new Llama 3.1 models. It seems to overestimate how much VRAM is needed, and then not offload everything to GPU although in hindsight it could have.

I see the following in the logs: memory.available="[22.7 GiB]" memory.required.full="23.5 GiB". So it thinks it needs 23.5 GiB and proceeds with a partial offload, yet leaves ~6 GiB VRAM free when it is done loading. To reproduce:

  1. ollama serve
  2. nvidia-smi, 823 MiB allocated
  3. ask for a generation with llama 3.1 8B through the API, 80000 context length (to provoke the problem), 31/33 layers offloaded
  4. nvidia-smi, 18505 MiB allocated

Logs:

PS C:\Users\xxx> nvidia-smi
Wed Jul 24 13:50:14 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
|  0%   46C    P8             48W /  390W |     823MiB /  24576MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1488    C+G   ...S Code Insiders\Code - Insiders.exe      N/A      |
|    0   N/A  N/A      2088      C   ...ta\Local\Programs\Ollama\ollama.exe      N/A      |
|    0   N/A  N/A      4020    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A      5588    C+G   ...5n1h2txyewy\ShellExperienceHost.exe      N/A      |
|    0   N/A  N/A      5788    C+G   ...siveControlPanel\SystemSettings.exe      N/A      |
|    0   N/A  N/A      6992    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A      7128    C+G   ...ne\Binaries\Win64\EpicWebHelper.exe      N/A      |
|    0   N/A  N/A      9152    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A      9868    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A      9892    C+G   ...2txyewy\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A     11876    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A     13008    C+G   ...UI3Apps\PowerToys.AdvancedPaste.exe      N/A      |
|    0   N/A  N/A     13364    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A     13908    C+G   ...inaries\Win64\EpicGamesLauncher.exe      N/A      |
|    0   N/A  N/A     15340    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A     15556    C+G   ...crosoft\Edge\Application\msedge.exe      N/A      |
|    0   N/A  N/A     17336    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
+-----------------------------------------------------------------------------------------+
PS C:\Users\xxx> ollama serve
2024/07/24 13:49:56 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-24T13:49:56.111+02:00 level=INFO source=images.go:784 msg="total blobs: 91"
time=2024-07-24T13:49:56.115+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0"
time=2024-07-24T13:49:56.117+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)"
time=2024-07-24T13:49:56.118+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]"
time=2024-07-24T13:49:56.118+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-24T13:49:56.287+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="196.8 MiB"
time=2024-07-24T13:49:56.288+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2024-07-24T13:50:47.458+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=31 layers.split="" memory.available="[22.7 GiB]" memory.required.full="23.5 GiB" memory.required.partial="22.5 GiB" memory.required.kv="9.8 GiB" memory.required.allocations="[22.5 GiB]" memory.weights.total="16.7 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="5.1 GiB" memory.graph.partial="5.4 GiB"
time=2024-07-24T13:50:47.464+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 80000 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --flash-attn --no-mmap --parallel 1 --port 53871"
time=2024-07-24T13:50:47.466+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-24T13:50:47.466+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-24T13:50:47.466+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="2212" timestamp=1721821847
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="2212" timestamp=1721821847 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53871" tid="2212" timestamp=1721821847
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 7
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
time=2024-07-24T13:50:47.725+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  1285.67 MiB
llm_load_tensors:      CUDA0 buffer size =  6851.97 MiB
llama_new_context_with_model: n_ctx      = 80128
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   313.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  9703.00 MiB
llama_new_context_with_model: KV self size  = 10016.00 MiB, K (f16): 5008.00 MiB, V (f16): 5008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   790.81 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   164.51 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 15
INFO [wmain] model loaded | tid="2212" timestamp=1721821849
time=2024-07-24T13:50:49.797+02:00 level=INFO source=server.go:622 msg="llama runner started in 2.33 seconds"
[GIN] 2024/07/24 - 13:50:50 | 200 |    2.7652477s |   x.x.x.x | POST     "/api/generate"
PS C:\Users\xxx> nvidia-smi
Wed Jul 24 13:51:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
| 56%   47C    P8             49W /  390W |   18505MiB /  24576MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1488    C+G   ...S Code Insiders\Code - Insiders.exe      N/A      |
|    0   N/A  N/A      2088      C   ...ta\Local\Programs\Ollama\ollama.exe      N/A      |
|    0   N/A  N/A      2536      C   ...\cuda_v11.3\ollama_llama_server.exe      N/A      |
|    0   N/A  N/A      4020    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A      5588    C+G   ...5n1h2txyewy\ShellExperienceHost.exe      N/A      |
|    0   N/A  N/A      5788    C+G   ...siveControlPanel\SystemSettings.exe      N/A      |
|    0   N/A  N/A      6992    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A      7128    C+G   ...ne\Binaries\Win64\EpicWebHelper.exe      N/A      |
|    0   N/A  N/A      9152    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A      9868    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A      9892    C+G   ...2txyewy\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A     11876    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A     13008    C+G   ...UI3Apps\PowerToys.AdvancedPaste.exe      N/A      |
|    0   N/A  N/A     13364    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A     13908    C+G   ...inaries\Win64\EpicGamesLauncher.exe      N/A      |
|    0   N/A  N/A     15340    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A     15556    C+G   ...crosoft\Edge\Application\msedge.exe      N/A      |
|    0   N/A  N/A     17336    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
+-----------------------------------------------------------------------------------------+

<!-- gh-comment-id:2247764017 --> @nutspiano commented on GitHub (Jul 24, 2024): Something seems off with the required VRAM calculations. I ran into this when trying to find my max context length on the new Llama 3.1 models. It seems to overestimate how much VRAM is needed, and then not offload everything to GPU although in hindsight it could have. I see the following in the logs: memory.available="[22.7 GiB]" memory.required.full="23.5 GiB". So it thinks it needs 23.5 GiB and proceeds with a partial offload, yet leaves ~6 GiB VRAM free when it is done loading. To reproduce: 1. ollama serve 2. nvidia-smi, 823 MiB allocated 3. ask for a generation with llama 3.1 8B through the API, 80000 context length (to provoke the problem), 31/33 layers offloaded 4. nvidia-smi, 18505 MiB allocated Logs: ``` PS C:\Users\xxx> nvidia-smi Wed Jul 24 13:50:14 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 0% 46C P8 48W / 390W | 823MiB / 24576MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1488 C+G ...S Code Insiders\Code - Insiders.exe N/A | | 0 N/A N/A 2088 C ...ta\Local\Programs\Ollama\ollama.exe N/A | | 0 N/A N/A 4020 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 0 N/A N/A 5588 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 N/A N/A 5788 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 6992 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 7128 C+G ...ne\Binaries\Win64\EpicWebHelper.exe N/A | | 0 N/A N/A 9152 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 9868 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 0 N/A N/A 9892 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11876 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13008 C+G ...UI3Apps\PowerToys.AdvancedPaste.exe N/A | | 0 N/A N/A 13364 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 13908 C+G ...inaries\Win64\EpicGamesLauncher.exe N/A | | 0 N/A N/A 15340 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 15556 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 17336 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | +-----------------------------------------------------------------------------------------+ ``` ``` PS C:\Users\xxx> ollama serve 2024/07/24 13:49:56 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-24T13:49:56.111+02:00 level=INFO source=images.go:784 msg="total blobs: 91" time=2024-07-24T13:49:56.115+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0" time=2024-07-24T13:49:56.117+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)" time=2024-07-24T13:49:56.118+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]" time=2024-07-24T13:49:56.118+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-24T13:49:56.287+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="196.8 MiB" time=2024-07-24T13:49:56.288+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2024-07-24T13:50:47.458+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=31 layers.split="" memory.available="[22.7 GiB]" memory.required.full="23.5 GiB" memory.required.partial="22.5 GiB" memory.required.kv="9.8 GiB" memory.required.allocations="[22.5 GiB]" memory.weights.total="16.7 GiB" memory.weights.repeating="16.2 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="5.1 GiB" memory.graph.partial="5.4 GiB" time=2024-07-24T13:50:47.464+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 80000 --batch-size 512 --embedding --log-disable --n-gpu-layers 31 --flash-attn --no-mmap --parallel 1 --port 53871" time=2024-07-24T13:50:47.466+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-24T13:50:47.466+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" time=2024-07-24T13:50:47.466+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="2212" timestamp=1721821847 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="2212" timestamp=1721821847 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="53871" tid="2212" timestamp=1721821847 llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 7 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q8_0: 226 tensors time=2024-07-24T13:50:47.725+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 31 repeating layers to GPU llm_load_tensors: offloaded 31/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 1285.67 MiB llm_load_tensors: CUDA0 buffer size = 6851.97 MiB llama_new_context_with_model: n_ctx = 80128 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 313.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 9703.00 MiB llama_new_context_with_model: KV self size = 10016.00 MiB, K (f16): 5008.00 MiB, V (f16): 5008.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 790.81 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 164.51 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 15 INFO [wmain] model loaded | tid="2212" timestamp=1721821849 time=2024-07-24T13:50:49.797+02:00 level=INFO source=server.go:622 msg="llama runner started in 2.33 seconds" [GIN] 2024/07/24 - 13:50:50 | 200 | 2.7652477s | x.x.x.x | POST "/api/generate" ``` ``` PS C:\Users\xxx> nvidia-smi Wed Jul 24 13:51:02 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 56% 47C P8 49W / 390W | 18505MiB / 24576MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1488 C+G ...S Code Insiders\Code - Insiders.exe N/A | | 0 N/A N/A 2088 C ...ta\Local\Programs\Ollama\ollama.exe N/A | | 0 N/A N/A 2536 C ...\cuda_v11.3\ollama_llama_server.exe N/A | | 0 N/A N/A 4020 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 0 N/A N/A 5588 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 N/A N/A 5788 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 6992 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 7128 C+G ...ne\Binaries\Win64\EpicWebHelper.exe N/A | | 0 N/A N/A 9152 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 9868 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 0 N/A N/A 9892 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11876 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13008 C+G ...UI3Apps\PowerToys.AdvancedPaste.exe N/A | | 0 N/A N/A 13364 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 13908 C+G ...inaries\Win64\EpicGamesLauncher.exe N/A | | 0 N/A N/A 15340 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 15556 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 17336 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@nutspiano commented on GitHub (Jul 24, 2024):

And btw, I am not complaining about slow inference speeds because of a few non-offloaded layers, I just think the entire required VRAM calculation is off. Here's a more aggressive example with 100000 context that now offloads only 25 of 33 layers, yet leaves ~7 GiB VRAM free.

Logs: (server logs stripped of repeating metadata)

PS C:\Users\xxx> nvidia-smi
Wed Jul 24 14:24:48 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
|  0%   50C    P8             49W /  390W |     817MiB /  24576MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1488    C+G   ...S Code Insiders\Code - Insiders.exe      N/A      |
|    0   N/A  N/A      4020    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A      5588    C+G   ...5n1h2txyewy\ShellExperienceHost.exe      N/A      |
|    0   N/A  N/A      5788    C+G   ...siveControlPanel\SystemSettings.exe      N/A      |
|    0   N/A  N/A      6992    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A      7128    C+G   ...ne\Binaries\Win64\EpicWebHelper.exe      N/A      |
|    0   N/A  N/A      9152    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A      9344    C+G   ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe      N/A      |
|    0   N/A  N/A      9868    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A      9892    C+G   ...2txyewy\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A     11876    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A     13008    C+G   ...UI3Apps\PowerToys.AdvancedPaste.exe      N/A      |
|    0   N/A  N/A     13364    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A     13908    C+G   ...inaries\Win64\EpicGamesLauncher.exe      N/A      |
|    0   N/A  N/A     15340    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A     15556    C+G   ...crosoft\Edge\Application\msedge.exe      N/A      |
|    0   N/A  N/A     17336    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
+-----------------------------------------------------------------------------------------+

2024/07/24 14:24:53 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-24T14:24:53.952+02:00 level=INFO source=images.go:784 msg="total blobs: 91"
time=2024-07-24T14:24:53.957+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0"
time=2024-07-24T14:24:53.959+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)"
time=2024-07-24T14:24:53.960+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11.3]"
time=2024-07-24T14:24:53.960+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-24T14:24:54.123+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="201.8 MiB"
time=2024-07-24T14:24:54.124+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2024-07-24T14:25:04.450+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=25 layers.split="" memory.available="[22.7 GiB]" memory.required.full="27.4 GiB" memory.required.partial="22.7 GiB" memory.required.kv="12.2 GiB" memory.required.allocations="[22.7 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="6.3 GiB" memory.graph.partial="6.7 GiB"
time=2024-07-24T14:25:04.456+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 100000 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --flash-attn --no-mmap --parallel 1 --port 61410"
time=2024-07-24T14:25:04.459+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-24T14:25:04.459+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-24T14:25:04.459+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="20012" timestamp=1721823904
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="20012" timestamp=1721823904 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61410" tid="20012" timestamp=1721823904
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest))
[...]
time=2024-07-24T14:25:04.716+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  2611.86 MiB
llm_load_tensors:      CUDA0 buffer size =  5525.78 MiB
llama_new_context_with_model: n_ctx      = 100096
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2737.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  9775.00 MiB
llama_new_context_with_model: KV self size  = 12512.00 MiB, K (f16): 6256.00 MiB, V (f16): 6256.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   790.81 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   203.51 MiB
llama_new_context_with_model: graph nodes  = 903
llama_new_context_with_model: graph splits = 81
INFO [wmain] model loaded | tid="20012" timestamp=1721823907
time=2024-07-24T14:25:07.548+02:00 level=INFO source=server.go:622 msg="llama runner started in 3.09 seconds"
[GIN] 2024/07/24 - 14:25:09 | 200 |    5.0798891s |   x.x.x.x | POST     "/api/generate"
PS C:\Users\xxx> nvidia-smi
Wed Jul 24 14:25:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
| 53%   49C    P8             50W /  390W |   17263MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1488    C+G   ...S Code Insiders\Code - Insiders.exe      N/A      |
|    0   N/A  N/A      3496      C   ...ta\Local\Programs\Ollama\ollama.exe      N/A      |
|    0   N/A  N/A      4020    C+G   ...__8wekyb3d8bbwe\WindowsTerminal.exe      N/A      |
|    0   N/A  N/A      5588    C+G   ...5n1h2txyewy\ShellExperienceHost.exe      N/A      |
|    0   N/A  N/A      5788    C+G   ...siveControlPanel\SystemSettings.exe      N/A      |
|    0   N/A  N/A      6992    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A      7128    C+G   ...ne\Binaries\Win64\EpicWebHelper.exe      N/A      |
|    0   N/A  N/A      9152    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A      9344    C+G   ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe      N/A      |
|    0   N/A  N/A      9868    C+G   ...nt.CBS_cw5n1h2txyewy\SearchHost.exe      N/A      |
|    0   N/A  N/A      9892    C+G   ...2txyewy\StartMenuExperienceHost.exe      N/A      |
|    0   N/A  N/A     11876    C+G   ...t.LockApp_cw5n1h2txyewy\LockApp.exe      N/A      |
|    0   N/A  N/A     13008    C+G   ...UI3Apps\PowerToys.AdvancedPaste.exe      N/A      |
|    0   N/A  N/A     13364    C+G   ...ekyb3d8bbwe\PhoneExperienceHost.exe      N/A      |
|    0   N/A  N/A     13908    C+G   ...inaries\Win64\EpicGamesLauncher.exe      N/A      |
|    0   N/A  N/A     15340    C+G   ...CBS_cw5n1h2txyewy\TextInputHost.exe      N/A      |
|    0   N/A  N/A     15556    C+G   ...crosoft\Edge\Application\msedge.exe      N/A      |
|    0   N/A  N/A     17336    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A     20100      C   ...\cuda_v11.3\ollama_llama_server.exe      N/A      |
+-----------------------------------------------------------------------------------------+

<!-- gh-comment-id:2247811613 --> @nutspiano commented on GitHub (Jul 24, 2024): And btw, I am not complaining about slow inference speeds because of a few non-offloaded layers, I just think the entire required VRAM calculation is off. Here's a more aggressive example with 100000 context that now offloads only 25 of 33 layers, yet leaves ~7 GiB VRAM free. Logs: (server logs stripped of repeating metadata) ``` PS C:\Users\xxx> nvidia-smi Wed Jul 24 14:24:48 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 0% 50C P8 49W / 390W | 817MiB / 24576MiB | 6% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1488 C+G ...S Code Insiders\Code - Insiders.exe N/A | | 0 N/A N/A 4020 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 0 N/A N/A 5588 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 N/A N/A 5788 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 6992 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 7128 C+G ...ne\Binaries\Win64\EpicWebHelper.exe N/A | | 0 N/A N/A 9152 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 9344 C+G ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A | | 0 N/A N/A 9868 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 0 N/A N/A 9892 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11876 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13008 C+G ...UI3Apps\PowerToys.AdvancedPaste.exe N/A | | 0 N/A N/A 13364 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 13908 C+G ...inaries\Win64\EpicGamesLauncher.exe N/A | | 0 N/A N/A 15340 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 15556 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 17336 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | +-----------------------------------------------------------------------------------------+ ``` ``` 2024/07/24 14:24:53 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-24T14:24:53.952+02:00 level=INFO source=images.go:784 msg="total blobs: 91" time=2024-07-24T14:24:53.957+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0" time=2024-07-24T14:24:53.959+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)" time=2024-07-24T14:24:53.960+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11.3]" time=2024-07-24T14:24:53.960+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-24T14:24:54.123+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="201.8 MiB" time=2024-07-24T14:24:54.124+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2024-07-24T14:25:04.450+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=25 layers.split="" memory.available="[22.7 GiB]" memory.required.full="27.4 GiB" memory.required.partial="22.7 GiB" memory.required.kv="12.2 GiB" memory.required.allocations="[22.7 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="6.3 GiB" memory.graph.partial="6.7 GiB" time=2024-07-24T14:25:04.456+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 100000 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --flash-attn --no-mmap --parallel 1 --port 61410" time=2024-07-24T14:25:04.459+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-24T14:25:04.459+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" time=2024-07-24T14:25:04.459+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="20012" timestamp=1721823904 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="20012" timestamp=1721823904 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61410" tid="20012" timestamp=1721823904 llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest)) [...] time=2024-07-24T14:25:04.716+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" [...] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 25 repeating layers to GPU llm_load_tensors: offloaded 25/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 2611.86 MiB llm_load_tensors: CUDA0 buffer size = 5525.78 MiB llama_new_context_with_model: n_ctx = 100096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 2737.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 9775.00 MiB llama_new_context_with_model: KV self size = 12512.00 MiB, K (f16): 6256.00 MiB, V (f16): 6256.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 790.81 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 203.51 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 81 INFO [wmain] model loaded | tid="20012" timestamp=1721823907 time=2024-07-24T14:25:07.548+02:00 level=INFO source=server.go:622 msg="llama runner started in 3.09 seconds" [GIN] 2024/07/24 - 14:25:09 | 200 | 5.0798891s | x.x.x.x | POST "/api/generate" ``` ``` PS C:\Users\xxx> nvidia-smi Wed Jul 24 14:25:18 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 53% 49C P8 50W / 390W | 17263MiB / 24576MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1488 C+G ...S Code Insiders\Code - Insiders.exe N/A | | 0 N/A N/A 3496 C ...ta\Local\Programs\Ollama\ollama.exe N/A | | 0 N/A N/A 4020 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A | | 0 N/A N/A 5588 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A | | 0 N/A N/A 5788 C+G ...siveControlPanel\SystemSettings.exe N/A | | 0 N/A N/A 6992 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 7128 C+G ...ne\Binaries\Win64\EpicWebHelper.exe N/A | | 0 N/A N/A 9152 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 9344 C+G ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A | | 0 N/A N/A 9868 C+G ...nt.CBS_cw5n1h2txyewy\SearchHost.exe N/A | | 0 N/A N/A 9892 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11876 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 13008 C+G ...UI3Apps\PowerToys.AdvancedPaste.exe N/A | | 0 N/A N/A 13364 C+G ...ekyb3d8bbwe\PhoneExperienceHost.exe N/A | | 0 N/A N/A 13908 C+G ...inaries\Win64\EpicGamesLauncher.exe N/A | | 0 N/A N/A 15340 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 15556 C+G ...crosoft\Edge\Application\msedge.exe N/A | | 0 N/A N/A 17336 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A | | 0 N/A N/A 20100 C ...\cuda_v11.3\ollama_llama_server.exe N/A | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

Does it change if you turn flash attention off?

My understanding is that ollama does the calculations for how many layers to offload based on the size of the model and the expected usage required by the context window, but it's llama.cpp that actually does the memory allocation. If the latter is more efficient than ollama expects, there will be unused memory. Flash attention may come in to play here as it's supposed to be a more efficient use of KV space.

<!-- gh-comment-id:2247831927 --> @rick-github commented on GitHub (Jul 24, 2024): Does it change if you turn flash attention off? My understanding is that ollama does the calculations for how many layers to offload based on the size of the model and the expected usage required by the context window, but it's llama.cpp that actually does the memory allocation. If the latter is more efficient than ollama expects, there will be unused memory. Flash attention may come in to play here as it's supposed to be a more efficient use of KV space.
Author
Owner

@nutspiano commented on GitHub (Jul 24, 2024):

Good catch, that seems like it. Here is the same 100k context one without flash attention. Very similarly 25/33 offloaded, but now VRAM is suitably crammed after loading the model, and ~7 GiB spilled over into system ram.

Logs:

PS C:\Users\xxx> systeminfo
[...]
Total Physical Memory:     65 439 MB
Available Physical Memory: 53 070 MB
[...]
PS C:\Users\xxx> nvidia-smi
Wed Jul 24 15:10:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
|  0%   49C    P8             49W /  390W |     874MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
[...]
2024/07/24 15:10:59 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-24T15:10:59.262+02:00 level=INFO source=images.go:784 msg="total blobs: 91"
time=2024-07-24T15:10:59.266+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0"
time=2024-07-24T15:10:59.269+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)"
time=2024-07-24T15:10:59.269+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11.3 rocm_v6.1 cpu cpu_avx cpu_avx2]"
time=2024-07-24T15:10:59.269+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-24T15:10:59.447+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="145.4 MiB"
time=2024-07-24T15:10:59.448+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2024-07-24T15:11:05.984+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=25 layers.split="" memory.available="[22.7 GiB]" memory.required.full="27.4 GiB" memory.required.partial="22.7 GiB" memory.required.kv="12.2 GiB" memory.required.allocations="[22.7 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="6.3 GiB" memory.graph.partial="6.7 GiB"
time=2024-07-24T15:11:05.991+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 100000 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --no-mmap --parallel 1 --port 62677"
time=2024-07-24T15:11:05.993+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-24T15:11:05.993+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding"
time=2024-07-24T15:11:05.993+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="20196" timestamp=1721826666
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="20196" timestamp=1721826666 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="62677" tid="20196" timestamp=1721826666
llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest))
[...]
time=2024-07-24T15:11:06.252+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model"
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  2611.86 MiB
llm_load_tensors:      CUDA0 buffer size =  5525.78 MiB
llama_new_context_with_model: n_ctx      = 100000
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2734.38 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  9765.62 MiB
llama_new_context_with_model: KV self size  = 12500.00 MiB, K (f16): 6250.00 MiB, V (f16): 6250.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  6868.94 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   203.32 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 81
INFO [wmain] model loaded | tid="20196" timestamp=1721826669
time=2024-07-24T15:11:09.813+02:00 level=INFO source=server.go:622 msg="llama runner started in 3.82 seconds"
[GIN] 2024/07/24 - 15:11:10 | 200 |    4.8330053s |   x.x.x.x | POST     "/api/generate"
PS C:\Users\xxx> systeminfo
[...]
Total Physical Memory:     65 439 MB
Available Physical Memory: 46 157 MB
[...]
PS C:\Users\xxx> nvidia-smi
Wed Jul 24 15:11:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 556.12                 Driver Version: 556.12         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
| 53%   49C    P8             49W /  390W |   23670MiB /  24576MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
[...]
<!-- gh-comment-id:2247932987 --> @nutspiano commented on GitHub (Jul 24, 2024): Good catch, that seems like it. Here is the same 100k context one without flash attention. Very similarly 25/33 offloaded, but now VRAM is suitably crammed after loading the model, and ~7 GiB spilled over into system ram. Logs: ``` PS C:\Users\xxx> systeminfo [...] Total Physical Memory: 65 439 MB Available Physical Memory: 53 070 MB [...] PS C:\Users\xxx> nvidia-smi Wed Jul 24 15:10:51 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 0% 49C P8 49W / 390W | 874MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ [...] ``` ``` 2024/07/24 15:10:59 routes.go:1100: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://x.x.x.x:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\xxx\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-24T15:10:59.262+02:00 level=INFO source=images.go:784 msg="total blobs: 91" time=2024-07-24T15:10:59.266+02:00 level=INFO source=images.go:791 msg="total unused blobs removed: 0" time=2024-07-24T15:10:59.269+02:00 level=INFO source=routes.go:1147 msg="Listening on x.x.x.x:11434 (version 0.2.8)" time=2024-07-24T15:10:59.269+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11.3 rocm_v6.1 cpu cpu_avx cpu_avx2]" time=2024-07-24T15:10:59.269+02:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-24T15:10:59.447+02:00 level=INFO source=gpu.go:287 msg="detected OS VRAM overhead" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" overhead="145.4 MiB" time=2024-07-24T15:10:59.448+02:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-0837b393-3565-99fe-5263-2d20167323c7 library=cuda compute=8.6 driver=12.5 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB" time=2024-07-24T15:11:05.984+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=25 layers.split="" memory.available="[22.7 GiB]" memory.required.full="27.4 GiB" memory.required.partial="22.7 GiB" memory.required.kv="12.2 GiB" memory.required.allocations="[22.7 GiB]" memory.weights.total="19.1 GiB" memory.weights.repeating="18.6 GiB" memory.weights.nonrepeating="532.3 MiB" memory.graph.full="6.3 GiB" memory.graph.partial="6.7 GiB" time=2024-07-24T15:11:05.991+02:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\xxx\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\xxx\\.ollama\\models\\blobs\\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a --ctx-size 100000 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --no-mmap --parallel 1 --port 62677" time=2024-07-24T15:11:05.993+02:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-24T15:11:05.993+02:00 level=INFO source=server.go:583 msg="waiting for llama runner to start responding" time=2024-07-24T15:11:05.993+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3440 commit="d94c6e0c" tid="20196" timestamp=1721826666 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="20196" timestamp=1721826666 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="62677" tid="20196" timestamp=1721826666 llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from C:\Users\xxx\.ollama\models\blobs\sha256-d36aafdc1d822f932f3fd3ddc18296628764c5e43f153e9c02b29f5c4525cf2a (version GGUF V3 (latest)) [...] time=2024-07-24T15:11:06.252+02:00 level=INFO source=server.go:617 msg="waiting for server to become available" status="llm server loading model" [...] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 25 repeating layers to GPU llm_load_tensors: offloaded 25/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 2611.86 MiB llm_load_tensors: CUDA0 buffer size = 5525.78 MiB llama_new_context_with_model: n_ctx = 100000 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 2734.38 MiB llama_kv_cache_init: CUDA0 KV buffer size = 9765.62 MiB llama_new_context_with_model: KV self size = 12500.00 MiB, K (f16): 6250.00 MiB, V (f16): 6250.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 6868.94 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 203.32 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 81 INFO [wmain] model loaded | tid="20196" timestamp=1721826669 time=2024-07-24T15:11:09.813+02:00 level=INFO source=server.go:622 msg="llama runner started in 3.82 seconds" [GIN] 2024/07/24 - 15:11:10 | 200 | 4.8330053s | x.x.x.x | POST "/api/generate" ``` ``` PS C:\Users\xxx> systeminfo [...] Total Physical Memory: 65 439 MB Available Physical Memory: 46 157 MB [...] PS C:\Users\xxx> nvidia-smi Wed Jul 24 15:11:23 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 556.12 Driver Version: 556.12 CUDA Version: 12.5 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:06:00.0 On | N/A | | 53% 49C P8 49W / 390W | 23670MiB / 24576MiB | 3% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ [...] ```
Author
Owner

@AeneasZhu commented on GitHub (Jul 24, 2024):

@rick-github @nutspiano It is odd that sometimes ollama/llama.cpp run gemma2 very slow but sometimes smoothly. The running speed of gemma2 varies anytime. Shall I post an issue on llama.cpp?

<!-- gh-comment-id:2248167198 --> @AeneasZhu commented on GitHub (Jul 24, 2024): @rick-github @nutspiano It is odd that sometimes ollama/llama.cpp run gemma2 very slow but sometimes smoothly. The running speed of gemma2 varies anytime. Shall I post an issue on llama.cpp?
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

You can try, but they will also probably ask you for the output of nvidia-smi.

<!-- gh-comment-id:2248183836 --> @rick-github commented on GitHub (Jul 24, 2024): You can try, but they will also probably ask you for the output of `nvidia-smi`.
Author
Owner

@AeneasZhu commented on GitHub (Jul 24, 2024):

@rick-github

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.68                 Driver Version: 531.68       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 T...  WDDM | 00000000:01:00.0 Off |                  N/A |
| N/A   48C    P8               12W /  N/A|   7228MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      9728      C   ...\cuda_v11.3\ollama_llama_server.exe    N/A      |
|    0   N/A  N/A     17336    C+G   ...Brave-Browser\Application\brave.exe    N/A      |
+---------------------------------------------------------------------------------------+

Here is the output. But I don't read the messages, since I have few knowledge on computer science.

<!-- gh-comment-id:2248254107 --> @AeneasZhu commented on GitHub (Jul 24, 2024): @rick-github ``` +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 531.68 Driver Version: 531.68 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3070 T... WDDM | 00000000:01:00.0 Off | N/A | | N/A 48C P8 12W / N/A| 7228MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 9728 C ...\cuda_v11.3\ollama_llama_server.exe N/A | | 0 N/A N/A 17336 C+G ...Brave-Browser\Application\brave.exe N/A | +---------------------------------------------------------------------------------------+ ``` Here is the output. But I don't read the messages, since I have few knowledge on computer science.
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

It's unfortunate that nvidia-smi on windows doesn't report GPU Memory Usage. But we see here that your Brave browser is taking up GPU memory. If Brave works like other browsers, over time it wants to use more and more resources. So you can try completely closing your browser (not just a tab, the whole program) and then restarting it. Also restart ollama, and see if gemma2 runs any more smoothly.

Realistically, your 8G VRAM is not really suitable for running mid-sized models along side other GPU using software. You can try smaller models like qwen2:1.5b or qwen2:7b

<!-- gh-comment-id:2248290280 --> @rick-github commented on GitHub (Jul 24, 2024): It's unfortunate that `nvidia-smi` on windows doesn't report GPU Memory Usage. But we see here that your Brave browser is taking up GPU memory. If Brave works like other browsers, over time it wants to use more and more resources. So you can try completely closing your browser (not just a tab, the whole program) and then restarting it. Also restart ollama, and see if gemma2 runs any more smoothly. Realistically, your 8G VRAM is not really suitable for running mid-sized models along side other GPU using software. You can try smaller models like [qwen2:1.5b](https://ollama.com/library/qwen2:1.5b) or [qwen2:7b](https://ollama.com/library/qwen2:7b)
Author
Owner

@nutspiano commented on GitHub (Jul 24, 2024):

@AeneasZhu As I understand your case and read your logs, I think your slowness simply comes from not all of the model being offloaded to your GPU. Looking at these lines in your (and my) logs, here's one of yours:

time=2024-07-21T09:32:24.040+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"

Look at the memory.available and memory.required.full. If less is available than what ollama thinks it needs for a full offload, it will not ask llama.cpp for a full offload. This can be seen on the next line, where it asks for a number of layers to be offloaded:

ime=2024-07-21T09:32:24.045+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 59469"

It only asks for 42 layers to be offloaded to your gpu, with --n-gpu-layers 42. And Gemma has 43 layers (so close!), so things will be slow. I think the problem here is simply lack of memory.

Although there is probably a sticker on your GPU that says "8 GB", in reality some is already used by your operating system for various things, like browser windows (I see you are running Brave), and for displaying your desktop on a screen. So you lose some of those 8 GB just by turning your PC on. Which may not be what you want in this case when you're trying to run LLMs, but there it is. So in reality, maybe you have about 7 GB available. Which is what we see, in the first line I picked from your log there there you had memory.available="[6.9 GiB]".

Try running even smaller quants, or like @rick-github suggested, other models, and see if you can get them to run fast. The new Llama 3.1 is a juicy alternative, and 8B parameters as opposed to Gemma 2's 9B.

The things I described in my posts were (probably) to do with ollamas VRAM estimation when using flash attention. But you're not using flash attention, so that doesn't affect your slowness.

<!-- gh-comment-id:2248331891 --> @nutspiano commented on GitHub (Jul 24, 2024): @AeneasZhu As I understand your case and read your logs, I think your slowness simply comes from not all of the model being offloaded to your GPU. Looking at these lines in your (and my) logs, here's one of yours: ``` time=2024-07-21T09:32:24.040+08:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=42 layers.split="" memory.available="[6.9 GiB]" memory.required.full="7.5 GiB" memory.required.partial="6.8 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" ``` Look at the `memory.available` and `memory.required.full`. If less is available than what ollama thinks it needs for a full offload, it will not ask llama.cpp for a full offload. This can be seen on the next line, where it asks for a number of layers to be offloaded: ``` ime=2024-07-21T09:32:24.045+08:00 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Raven\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AGI\\ollama_models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 42 --no-mmap --parallel 1 --port 59469" ``` It only asks for 42 layers to be offloaded to your gpu, with `--n-gpu-layers 42`. And Gemma has 43 layers (so close!), so things will be slow. I think the problem here is simply lack of memory. Although there is probably a sticker on your GPU that says "8 GB", in reality some is already used by your operating system for various things, like browser windows (I see you are running Brave), and for displaying your desktop on a screen. So you lose some of those 8 GB just by turning your PC on. Which may not be what you want in this case when you're trying to run LLMs, but there it is. So in reality, maybe you have about 7 GB available. Which is what we see, in the first line I picked from your log there there you had `memory.available="[6.9 GiB]"`. Try running even smaller quants, or like @rick-github suggested, other models, and see if you can get them to run fast. The new Llama 3.1 is a juicy alternative, and 8B parameters as opposed to Gemma 2's 9B. The things I described in my posts were (probably) to do with ollamas VRAM estimation when using flash attention. But you're not using flash attention, so that doesn't affect your slowness.
Author
Owner

@nutspiano commented on GitHub (Jul 26, 2024):

@dhiltgen I don't think the original post here was a bug, but I do believe I (with the help of @rick-github) uncovered something about VRAM allocation when using flash attention. This is just a brief message pointing this out, in case it gets lost in all the log spam. I can move it to a new issue if you want.

<!-- gh-comment-id:2252509224 --> @nutspiano commented on GitHub (Jul 26, 2024): @dhiltgen I don't think the original post here was a bug, but I do believe I (with the help of @rick-github) uncovered something about VRAM allocation when using flash attention. This is just a brief message pointing this out, in case it gets lost in all the log spam. I can move it to a new issue if you want.
Author
Owner

@ColumbusAI commented on GitHub (Mar 25, 2025):

I'm having this issue. I have a rtx 4090 + rtx 3090 and loading the gemma 27b model quantized to 8bpw. I have more than enough VRAM between these gpus, yet ollama is using my cpu for model weights and it's soooo slow.

2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="11.8 GiB"

2025-03-24 18:32:28 time=2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="8.2 GiB"

2025-03-24 18:32:28 time=2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA1 size="9.0 GiB"

<!-- gh-comment-id:2749835952 --> @ColumbusAI commented on GitHub (Mar 25, 2025): I'm having this issue. I have a rtx 4090 + rtx 3090 and loading the gemma 27b model quantized to 8bpw. I have more than enough VRAM between these gpus, yet ollama is using my cpu for model weights and it's soooo slow. 2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 **msg="model weights" buffer=CPU size="11.8 GiB"** 2025-03-24 18:32:28 time=2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 **msg="model weights" buffer=CUDA0 size="8.2 GiB"** 2025-03-24 18:32:28 time=2025-03-25T01:32:28.062Z level=INFO source=ggml.go:289 **msg="model weights" buffer=CUDA1 size="9.0 GiB"**
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29387