[GH-ISSUE #6338] ollama slower than llama.cpp #29738

Closed
opened 2026-04-22 08:54:59 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @phly95 on GitHub (Aug 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6338

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

When using the llm benchmark with ollama https://github.com/MinhNgyuen/llm-benchmark , I get around 80 t/s with gemma 2 2b. When asking the same questions to llama.cpp in conversation mode, I get 130 t/s. The llama.cpp command I'm running is ".\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv"

Is there a reason that ollama is ~38% slower than llama.cpp here?

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.5

Originally created by @phly95 on GitHub (Aug 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6338 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? When using the llm benchmark with ollama https://github.com/MinhNgyuen/llm-benchmark , I get around 80 t/s with gemma 2 2b. When asking the same questions to llama.cpp in conversation mode, I get 130 t/s. The llama.cpp command I'm running is ".\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv" Is there a reason that ollama is ~38% slower than llama.cpp here? ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.5
GiteaMirror added the performancenvidiabugneeds more infowindows labels 2026-04-22 08:55:00 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 13, 2024):

Server logs will aid in debugging.

<!-- gh-comment-id:2286299348 --> @rick-github commented on GitHub (Aug 13, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@phly95 commented on GitHub (Aug 13, 2024):

2024/08/13 09:25:42 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Philip\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-13T09:25:42.918-04:00 level=INFO source=images.go:782 msg="total blobs: 10"
time=2024-08-13T09:25:42.926-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-13T09:25:42.927-04:00 level=INFO source=routes.go:1170 msg="Listening on 127.0.0.1:11434 (version 0.3.5)"
time=2024-08-13T09:25:42.928-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]"
time=2024-08-13T09:25:42.928-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-13T09:25:43.218-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060 Ti" total="8.0 GiB" available="7.0 GiB"
[GIN] 2024/08/13 - 09:25:53 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:25:53 | 200 |     52.6791ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-13T09:25:53.238-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6771941376 required="3.3 GiB"
time=2024-08-13T09:25:53.238-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.3 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:25:53.251-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 60908"
time=2024-08-13T09:25:53.279-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="9372" timestamp=1723555553
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="9372" timestamp=1723555553 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60908" tid="9372" timestamp=1723555553
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
time=2024-08-13T09:25:53.791-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="9372" timestamp=1723555557
time=2024-08-13T09:25:57.268-04:00 level=INFO source=server.go:632 msg="llama runner started in 3.99 seconds"
[GIN] 2024/08/13 - 09:25:57 | 200 |    4.1007296s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:25:58 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:26:12 | 200 |    8.8789117s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:02 | 200 |      5.7099ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/08/13 - 09:29:06 | 200 |     3.427132s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:14 | 200 |    7.9011742s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-13T09:29:14.135-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
time=2024-08-13T09:29:14.447-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6820524032 required="6.1 GiB"
time=2024-08-13T09:29:14.447-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.4 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
time=2024-08-13T09:29:14.455-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61318"
time=2024-08-13T09:29:14.461-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="14936" timestamp=1723555754
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="14936" timestamp=1723555754 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61318" tid="14936" timestamp=1723555754
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 67
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
time=2024-08-13T09:29:14.722-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =    52.84 MiB
llm_load_tensors:      CUDA0 buffer size =  2021.84 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.54 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="14936" timestamp=1723555757
time=2024-08-13T09:29:17.340-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.88 seconds"
[GIN] 2024/08/13 - 09:29:18 | 200 |     4.660208s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:24 | 200 |    5.7101586s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:43 | 200 |      1.3685ms |       127.0.0.1 | GET      "/api/tags"
time=2024-08-13T09:29:44.060-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="370.5 MiB"
time=2024-08-13T09:29:44.403-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6819233792 required="3.3 GiB"
time=2024-08-13T09:29:44.403-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.4 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:29:44.412-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61381"
time=2024-08-13T09:29:44.417-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="22872" timestamp=1723555784
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="22872" timestamp=1723555784 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61381" tid="22872" timestamp=1723555784
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-13T09:29:44.673-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="22872" timestamp=1723555786
time=2024-08-13T09:29:46.550-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.13 seconds"
[GIN] 2024/08/13 - 09:29:49 | 200 |    5.8284308s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:59 | 200 |     9.386519s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-13T09:29:59.225-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
time=2024-08-13T09:29:59.522-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6812811264 required="6.1 GiB"
time=2024-08-13T09:29:59.523-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.3 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
time=2024-08-13T09:29:59.531-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61409"
time=2024-08-13T09:29:59.536-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="12124" timestamp=1723555799
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="12124" timestamp=1723555799 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61409" tid="12124" timestamp=1723555799
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 67
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
time=2024-08-13T09:29:59.798-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =    52.84 MiB
llm_load_tensors:      CUDA0 buffer size =  2021.84 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.54 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="12124" timestamp=1723555801
time=2024-08-13T09:30:01.721-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.18 seconds"
[GIN] 2024/08/13 - 09:30:08 | 200 |    9.7045794s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:30:19 | 200 |   11.0502895s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:31:52 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:31:52 | 500 |       535.8µs |       127.0.0.1 | DELETE   "/api/delete"
[GIN] 2024/08/13 - 09:32:00 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:32:01 | 200 |     179.651ms |       127.0.0.1 | DELETE   "/api/delete"
[GIN] 2024/08/13 - 09:32:04 | 200 |         783µs |       127.0.0.1 | GET      "/api/tags"
time=2024-08-13T09:32:04.557-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="472.3 MiB"
time=2024-08-13T09:32:04.891-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6926028800 required="3.3 GiB"
time=2024-08-13T09:32:04.891-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.5 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:32:04.899-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61980"
time=2024-08-13T09:32:04.904-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:32:04.904-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:32:04.905-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="20144" timestamp=1723555924
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="20144" timestamp=1723555924 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61980" tid="20144" timestamp=1723555924
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-13T09:32:05.169-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="20144" timestamp=1723555926
time=2024-08-13T09:32:07.053-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.15 seconds"
[GIN] 2024/08/13 - 09:32:10 | 200 |    5.9310484s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:32:18 | 200 |    8.0926735s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:41:44 | 200 |       510.7µs |       127.0.0.1 | GET      "/api/version"
<!-- gh-comment-id:2286303187 --> @phly95 commented on GitHub (Aug 13, 2024): ``` 2024/08/13 09:25:42 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Philip\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-13T09:25:42.918-04:00 level=INFO source=images.go:782 msg="total blobs: 10" time=2024-08-13T09:25:42.926-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0" time=2024-08-13T09:25:42.927-04:00 level=INFO source=routes.go:1170 msg="Listening on 127.0.0.1:11434 (version 0.3.5)" time=2024-08-13T09:25:42.928-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]" time=2024-08-13T09:25:42.928-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs" time=2024-08-13T09:25:43.218-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060 Ti" total="8.0 GiB" available="7.0 GiB" [GIN] 2024/08/13 - 09:25:53 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/13 - 09:25:53 | 200 | 52.6791ms | 127.0.0.1 | POST "/api/show" time=2024-08-13T09:25:53.238-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6771941376 required="3.3 GiB" time=2024-08-13T09:25:53.238-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.3 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB" time=2024-08-13T09:25:53.251-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 60908" time=2024-08-13T09:25:53.279-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="9372" timestamp=1723555553 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="9372" timestamp=1723555553 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60908" tid="9372" timestamp=1723555553 llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers llama_model_loader: - kv 3: general.finetune str = it-transformers llama_model_loader: - kv 4: general.basename str = gemma-2.0 llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: gemma2.context_length u32 = 8192 llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304 llama_model_loader: - kv 9: gemma2.block_count u32 = 26 llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216 llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8 llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 16: general.file_type u32 = 2 llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 20: tokenizer.ggml.model str = llama llama_model_loader: - kv 21: tokenizer.ggml.pre str = default time=2024-08-13T09:25:53.791-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 105 tensors llama_model_loader: - type q4_0: 182 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 249 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 9216 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.61 B llm_load_print_meta: model size = 1.51 GiB (4.97 BPW) llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.26 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: CUDA_Host buffer size = 461.43 MiB llm_load_tensors: CUDA0 buffer size = 1548.29 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB llama_new_context_with_model: graph nodes = 1050 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="9372" timestamp=1723555557 time=2024-08-13T09:25:57.268-04:00 level=INFO source=server.go:632 msg="llama runner started in 3.99 seconds" [GIN] 2024/08/13 - 09:25:57 | 200 | 4.1007296s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:25:58 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/13 - 09:26:12 | 200 | 8.8789117s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:29:02 | 200 | 5.7099ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/08/13 - 09:29:06 | 200 | 3.427132s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:29:14 | 200 | 7.9011742s | 127.0.0.1 | POST "/api/chat" time=2024-08-13T09:29:14.135-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB" time=2024-08-13T09:29:14.447-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6820524032 required="6.1 GiB" time=2024-08-13T09:29:14.447-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.4 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB" time=2024-08-13T09:29:14.455-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61318" time=2024-08-13T09:29:14.461-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="14936" timestamp=1723555754 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="14936" timestamp=1723555754 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61318" tid="14936" timestamp=1723555754 llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.name str = Phi3 llama_model_loader: - kv 2: phi3.context_length u32 = 131072 llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072 llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192 llama_model_loader: - kv 6: phi3.block_count u32 = 32 llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32 llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96 llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: general.file_type u32 = 2 llama_model_loader: - kv 13: phi3.rope.scaling.attn_factor f32 = 1.190238 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {% for message in messages %}{% if me... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 67 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 67 llm_load_vocab: token to piece cache size = 0.1690 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32064 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 96 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 96 llm_load_print_meta: n_embd_head_v = 96 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3072 llm_load_print_meta: n_embd_v_gqa = 3072 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 3.82 B llm_load_print_meta: model size = 2.03 GiB (4.55 BPW) llm_load_print_meta: general.name = Phi3 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 32000 '<|endoftext|>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOT token = 32007 '<|end|>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes time=2024-08-13T09:29:14.722-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llm_load_tensors: ggml ctx size = 0.21 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 52.84 MiB llm_load_tensors: CUDA0 buffer size = 2021.84 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.54 MiB llama_new_context_with_model: CUDA0 compute buffer size = 564.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="14936" timestamp=1723555757 time=2024-08-13T09:29:17.340-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.88 seconds" [GIN] 2024/08/13 - 09:29:18 | 200 | 4.660208s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:29:24 | 200 | 5.7101586s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:29:43 | 200 | 1.3685ms | 127.0.0.1 | GET "/api/tags" time=2024-08-13T09:29:44.060-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="370.5 MiB" time=2024-08-13T09:29:44.403-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6819233792 required="3.3 GiB" time=2024-08-13T09:29:44.403-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.4 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB" time=2024-08-13T09:29:44.412-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61381" time=2024-08-13T09:29:44.417-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="22872" timestamp=1723555784 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="22872" timestamp=1723555784 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61381" tid="22872" timestamp=1723555784 llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers llama_model_loader: - kv 3: general.finetune str = it-transformers llama_model_loader: - kv 4: general.basename str = gemma-2.0 llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: gemma2.context_length u32 = 8192 llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304 llama_model_loader: - kv 9: gemma2.block_count u32 = 26 llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216 llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8 llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 16: general.file_type u32 = 2 llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 20: tokenizer.ggml.model str = llama llama_model_loader: - kv 21: tokenizer.ggml.pre str = default llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 105 tensors llama_model_loader: - type q4_0: 182 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-08-13T09:29:44.673-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 249 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 9216 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.61 B llm_load_print_meta: model size = 1.51 GiB (4.97 BPW) llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.26 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: CUDA_Host buffer size = 461.43 MiB llm_load_tensors: CUDA0 buffer size = 1548.29 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB llama_new_context_with_model: graph nodes = 1050 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="22872" timestamp=1723555786 time=2024-08-13T09:29:46.550-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.13 seconds" [GIN] 2024/08/13 - 09:29:49 | 200 | 5.8284308s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:29:59 | 200 | 9.386519s | 127.0.0.1 | POST "/api/chat" time=2024-08-13T09:29:59.225-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB" time=2024-08-13T09:29:59.522-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6812811264 required="6.1 GiB" time=2024-08-13T09:29:59.523-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.3 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB" time=2024-08-13T09:29:59.531-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61409" time=2024-08-13T09:29:59.536-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="12124" timestamp=1723555799 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="12124" timestamp=1723555799 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61409" tid="12124" timestamp=1723555799 llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.name str = Phi3 llama_model_loader: - kv 2: phi3.context_length u32 = 131072 llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072 llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192 llama_model_loader: - kv 6: phi3.block_count u32 = 32 llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32 llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96 llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: general.file_type u32 = 2 llama_model_loader: - kv 13: phi3.rope.scaling.attn_factor f32 = 1.190238 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {% for message in messages %}{% if me... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 67 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 67 llm_load_vocab: token to piece cache size = 0.1690 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32064 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_rot = 96 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 96 llm_load_print_meta: n_embd_head_v = 96 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3072 llm_load_print_meta: n_embd_v_gqa = 3072 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 3.82 B llm_load_print_meta: model size = 2.03 GiB (4.55 BPW) llm_load_print_meta: general.name = Phi3 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 32000 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 32000 '<|endoftext|>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOT token = 32007 '<|end|>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes time=2024-08-13T09:29:59.798-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llm_load_tensors: ggml ctx size = 0.21 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 52.84 MiB llm_load_tensors: CUDA0 buffer size = 2021.84 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.54 MiB llama_new_context_with_model: CUDA0 compute buffer size = 564.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="12124" timestamp=1723555801 time=2024-08-13T09:30:01.721-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.18 seconds" [GIN] 2024/08/13 - 09:30:08 | 200 | 9.7045794s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:30:19 | 200 | 11.0502895s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:31:52 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/13 - 09:31:52 | 500 | 535.8µs | 127.0.0.1 | DELETE "/api/delete" [GIN] 2024/08/13 - 09:32:00 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/13 - 09:32:01 | 200 | 179.651ms | 127.0.0.1 | DELETE "/api/delete" [GIN] 2024/08/13 - 09:32:04 | 200 | 783µs | 127.0.0.1 | GET "/api/tags" time=2024-08-13T09:32:04.557-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="472.3 MiB" time=2024-08-13T09:32:04.891-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6926028800 required="3.3 GiB" time=2024-08-13T09:32:04.891-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.5 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB" time=2024-08-13T09:32:04.899-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61980" time=2024-08-13T09:32:04.904-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-13T09:32:04.904-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding" time=2024-08-13T09:32:04.905-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3535 commit="1e6f6554" tid="20144" timestamp=1723555924 INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="20144" timestamp=1723555924 total_threads=16 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61980" tid="20144" timestamp=1723555924 llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers llama_model_loader: - kv 3: general.finetune str = it-transformers llama_model_loader: - kv 4: general.basename str = gemma-2.0 llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: gemma2.context_length u32 = 8192 llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304 llama_model_loader: - kv 9: gemma2.block_count u32 = 26 llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216 llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8 llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 16: general.file_type u32 = 2 llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 20: tokenizer.ggml.model str = llama llama_model_loader: - kv 21: tokenizer.ggml.pre str = default llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 105 tensors llama_model_loader: - type q4_0: 182 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-08-13T09:32:05.169-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 249 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 9216 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.61 B llm_load_print_meta: model size = 1.51 GiB (4.97 BPW) llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.26 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: CUDA_Host buffer size = 461.43 MiB llm_load_tensors: CUDA0 buffer size = 1548.29 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB llama_new_context_with_model: graph nodes = 1050 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="20144" timestamp=1723555926 time=2024-08-13T09:32:07.053-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.15 seconds" [GIN] 2024/08/13 - 09:32:10 | 200 | 5.9310484s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:32:18 | 200 | 8.0926735s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/08/13 - 09:41:44 | 200 | 510.7µs | 127.0.0.1 | GET "/api/version" ```
Author
Owner

@phly95 commented on GitHub (Aug 13, 2024):

In case it helps, here's the llama.cpp output:

\llama-b3542-bin-win-cuda-cu12.2.0-x64> .\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
Log start
main: build = 3542 (15fa07a5)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1723556264
llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-2
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["conversational", "text-generation"]
llama_model_loader: - kv   8:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   9:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv  10:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  11:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  12:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  13:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  16:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  19:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  20:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/gemma-2-2b-it-GGUF/gemma-...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 182
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.59 GiB (5.21 BPW)
llm_load_print_meta: general.name     = Gemma 2 2b It
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:        CPU buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1623.70 MiB
..........................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.91 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
main: chat template example: <start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model


system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> What is the sky blue?
The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here's a breakdown:

1. **Sunlight and its Colors:** Sunlight contains all colors of the rainbow, each with its own wavelength (like visible light).
2. **Earth's Atmosphere:** Our atmosphere is composed mostly of nitrogen and oxygen molecules.
3. **Scattering:** When sunlight enters the atmosphere, it interacts with these tiny molecules. The shorter wavelengths of light (blue and violet) are scattered more strongly than longer wavelengths like red and orange.
4. **Human Perception:**  Our eyes are most sensitive to blue light, meaning we perceive this scattered light as the dominant color of the sky.

**Why not other colors?**

* **Violet:** While violet light is scattered even more intensely than blue, our eyes are less sensitive to it, so we don't see it as prominently in the daytime sky.
* **Red and Orange:** These longer wavelengths are scattered less, which is why we see them as dominant during sunrise and sunset.

**In summary:** The blue sky is a result of sunlight being scattered by our atmosphere's molecules, making blue light dominate the color we perceive.


Let me know if you have any further questions!


> Write a report on the financials of Nvidia
## Nvidia Financial Snapshot: A Deep Dive

This report provides an overview of Nvidia's financial performance, analyzing key financial metrics and identifying key trends.

**Q1 & Q2 2023 Performance:**

* **Revenue**: Strong revenue growth continued in both Q1 and Q2 2023, driven by robust demand for data centers and AI solutions.
    * Q1 2023: $7.68 billion (up 14% year-over-year)
    * Q2 2023: $8.85 billion (up 29% year-over-year)
* **Net Income**:  Nvidia's net income saw a significant increase in both quarters, reflecting the company's strong performance and efficient cost management.
    * Q1 2023: $1.94 billion (up 68% year-over-year)
    * Q2 2023: $2.17 billion (up 64% year-over-year)
* **Earnings per Share**:  EPS also saw significant growth, reflecting the company's profitability and strong financial position.
    * Q1 2023: $0.85 per share
    * Q2 2023: $1.16 per share

**Drivers of Financial Success:**

* **Data Center Market:** Nvidia's data center business has been a key driver of revenue growth, fueled by demand for its GPUs (Graphics Processing Units) used in AI training and cloud computing.
* **Gaming Segment**:  While facing headwinds from increased competition, the gaming segment remains a significant contributor to Nvidia's revenue, benefiting from strong demand for high-performance graphics cards.
* **Automotive Sector:** The company's automotive segment has been experiencing rapid growth, driven by its technology enabling autonomous driving features and connected vehicles.

**Challenges & Risks:**

* **Geopolitical Tensions**:  The ongoing geopolitical tensions create uncertainty in the global economy, potentially impacting demand for Nvidia's products in various sectors.
* **Competition**:  Competition within the GPU market is intensifying as rival companies like AMD and Intel aggressively enter this space.
* **Macroeconomic Factors**: Economic slowdown and rising inflation pose challenges to overall demand across industries, including Nvidia's key markets.

**Future Outlook:**

* **Continued Growth in Data Centers & AI:** Nvidia expects sustained growth in data center and AI segments as companies invest heavily in cloud computing and artificial intelligence development.
* **Expansion into Automotive and Other Emerging Sectors:**  Nvidia is actively pursuing expansion opportunities in automotive, gaming, and other emerging markets to diversify its revenue streams.


**Key Financial Ratios:**

* **Profit Margin**: Nvidia has maintained a high profit margin across recent quarters, reflecting its focus on efficient operations and strong pricing strategies.
* **Return on Equity (ROE)**:  The company continues to deliver strong returns on shareholder equity, indicating efficient capital allocation and strong profitability.
* **Debt-to-Equity Ratio**:   Nvidia maintains a relatively low debt-to-equity ratio, demonstrating its sound financial position and ability to manage leverage effectively.


**Conclusion:**

Nvidia's financial performance remains strong, driven by robust demand for its technology across multiple market segments. The company has a clear strategic focus on data centers, AI, automotive, and gaming, positioning it well for future growth. However, the company faces challenges from increased competition, geopolitical tensions, and macroeconomic uncertainties.




**Disclaimer:** This report is based on publicly available financial information and should not be construed as financial advice. Please consult with a qualified professional before making any investment decisions.





>

llama_print_timings:        load time =    1626.24 ms
llama_print_timings:      sample time =    1444.86 ms /  1034 runs   (    1.40 ms per token,   715.64 tokens per second)
llama_print_timings: prompt eval time =   49812.18 ms /    33 tokens ( 1509.46 ms per token,     0.66 tokens per second)
llama_print_timings:        eval time =    8107.13 ms /  1032 runs   (    7.86 ms per token,   127.30 tokens per second)
llama_print_timings:       total time =   71165.62 ms /  1065 tokens
<!-- gh-comment-id:2286322934 --> @phly95 commented on GitHub (Aug 13, 2024): In case it helps, here's the llama.cpp output: ``` \llama-b3542-bin-win-cuda-cu12.2.0-x64> .\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv Log start main: build = 3542 (15fa07a5) main: built with MSVC 19.29.30154.0 for x64 main: seed = 1723556264 llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2 2b It llama_model_loader: - kv 3: general.finetune str = it llama_model_loader: - kv 4: general.basename str = gemma-2 llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: general.tags arr[str,2] = ["conversational", "text-generation"] llama_model_loader: - kv 8: gemma2.context_length u32 = 8192 llama_model_loader: - kv 9: gemma2.embedding_length u32 = 2304 llama_model_loader: - kv 10: gemma2.block_count u32 = 26 llama_model_loader: - kv 11: gemma2.feed_forward_length u32 = 9216 llama_model_loader: - kv 12: gemma2.attention.head_count u32 = 8 llama_model_loader: - kv 13: gemma2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 14: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 15: gemma2.attention.key_length u32 = 256 llama_model_loader: - kv 16: gemma2.attention.value_length u32 = 256 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 19: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 20: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 21: tokenizer.ggml.model str = llama llama_model_loader: - kv 22: tokenizer.ggml.pre str = default llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - kv 35: quantize.imatrix.file str = /models_out/gemma-2-2b-it-GGUF/gemma-... llama_model_loader: - kv 36: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 182 llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 128 llama_model_loader: - type f32: 105 tensors llama_model_loader: - type q4_K: 156 tensors llama_model_loader: - type q6_K: 27 tensors llm_load_vocab: special tokens cache size = 249 llm_load_vocab: token to piece cache size = 1.6014 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma2 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2304 llm_load_print_meta: n_layer = 26 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 4096 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 9216 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 2.61 B llm_load_print_meta: model size = 1.59 GiB (5.21 BPW) llm_load_print_meta: general.name = Gemma 2 2b It llm_load_print_meta: BOS token = 2 '<bos>' llm_load_print_meta: EOS token = 1 '<eos>' llm_load_print_meta: UNK token = 3 '<unk>' llm_load_print_meta: PAD token = 0 '<pad>' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_print_meta: EOT token = 107 '<end_of_turn>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.26 MiB llm_load_tensors: offloading 26 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 27/27 layers to GPU llm_load_tensors: CPU buffer size = 461.43 MiB llm_load_tensors: CUDA0 buffer size = 1623.70 MiB .......................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 3.91 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB llama_new_context_with_model: graph nodes = 1050 llama_new_context_with_model: graph splits = 2 main: chat template example: <start_of_turn>user You are a helpful assistant Hello<end_of_turn> <start_of_turn>model Hi there<end_of_turn> <start_of_turn>user How are you?<end_of_turn> <start_of_turn>model system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 1 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. > What is the sky blue? The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here's a breakdown: 1. **Sunlight and its Colors:** Sunlight contains all colors of the rainbow, each with its own wavelength (like visible light). 2. **Earth's Atmosphere:** Our atmosphere is composed mostly of nitrogen and oxygen molecules. 3. **Scattering:** When sunlight enters the atmosphere, it interacts with these tiny molecules. The shorter wavelengths of light (blue and violet) are scattered more strongly than longer wavelengths like red and orange. 4. **Human Perception:** Our eyes are most sensitive to blue light, meaning we perceive this scattered light as the dominant color of the sky. **Why not other colors?** * **Violet:** While violet light is scattered even more intensely than blue, our eyes are less sensitive to it, so we don't see it as prominently in the daytime sky. * **Red and Orange:** These longer wavelengths are scattered less, which is why we see them as dominant during sunrise and sunset. **In summary:** The blue sky is a result of sunlight being scattered by our atmosphere's molecules, making blue light dominate the color we perceive. Let me know if you have any further questions! > Write a report on the financials of Nvidia ## Nvidia Financial Snapshot: A Deep Dive This report provides an overview of Nvidia's financial performance, analyzing key financial metrics and identifying key trends. **Q1 & Q2 2023 Performance:** * **Revenue**: Strong revenue growth continued in both Q1 and Q2 2023, driven by robust demand for data centers and AI solutions. * Q1 2023: $7.68 billion (up 14% year-over-year) * Q2 2023: $8.85 billion (up 29% year-over-year) * **Net Income**: Nvidia's net income saw a significant increase in both quarters, reflecting the company's strong performance and efficient cost management. * Q1 2023: $1.94 billion (up 68% year-over-year) * Q2 2023: $2.17 billion (up 64% year-over-year) * **Earnings per Share**: EPS also saw significant growth, reflecting the company's profitability and strong financial position. * Q1 2023: $0.85 per share * Q2 2023: $1.16 per share **Drivers of Financial Success:** * **Data Center Market:** Nvidia's data center business has been a key driver of revenue growth, fueled by demand for its GPUs (Graphics Processing Units) used in AI training and cloud computing. * **Gaming Segment**: While facing headwinds from increased competition, the gaming segment remains a significant contributor to Nvidia's revenue, benefiting from strong demand for high-performance graphics cards. * **Automotive Sector:** The company's automotive segment has been experiencing rapid growth, driven by its technology enabling autonomous driving features and connected vehicles. **Challenges & Risks:** * **Geopolitical Tensions**: The ongoing geopolitical tensions create uncertainty in the global economy, potentially impacting demand for Nvidia's products in various sectors. * **Competition**: Competition within the GPU market is intensifying as rival companies like AMD and Intel aggressively enter this space. * **Macroeconomic Factors**: Economic slowdown and rising inflation pose challenges to overall demand across industries, including Nvidia's key markets. **Future Outlook:** * **Continued Growth in Data Centers & AI:** Nvidia expects sustained growth in data center and AI segments as companies invest heavily in cloud computing and artificial intelligence development. * **Expansion into Automotive and Other Emerging Sectors:** Nvidia is actively pursuing expansion opportunities in automotive, gaming, and other emerging markets to diversify its revenue streams. **Key Financial Ratios:** * **Profit Margin**: Nvidia has maintained a high profit margin across recent quarters, reflecting its focus on efficient operations and strong pricing strategies. * **Return on Equity (ROE)**: The company continues to deliver strong returns on shareholder equity, indicating efficient capital allocation and strong profitability. * **Debt-to-Equity Ratio**: Nvidia maintains a relatively low debt-to-equity ratio, demonstrating its sound financial position and ability to manage leverage effectively. **Conclusion:** Nvidia's financial performance remains strong, driven by robust demand for its technology across multiple market segments. The company has a clear strategic focus on data centers, AI, automotive, and gaming, positioning it well for future growth. However, the company faces challenges from increased competition, geopolitical tensions, and macroeconomic uncertainties. **Disclaimer:** This report is based on publicly available financial information and should not be construed as financial advice. Please consult with a qualified professional before making any investment decisions. > llama_print_timings: load time = 1626.24 ms llama_print_timings: sample time = 1444.86 ms / 1034 runs ( 1.40 ms per token, 715.64 tokens per second) llama_print_timings: prompt eval time = 49812.18 ms / 33 tokens ( 1509.46 ms per token, 0.66 tokens per second) llama_print_timings: eval time = 8107.13 ms / 1032 runs ( 7.86 ms per token, 127.30 tokens per second) llama_print_timings: total time = 71165.62 ms / 1065 tokens ```
Author
Owner

@phly95 commented on GitHub (Aug 13, 2024):

One thing I noticed (with the help of an llm) is that llama.cpp shows fma = 1 while ollama shows it as 0.

<!-- gh-comment-id:2286339858 --> @phly95 commented on GitHub (Aug 13, 2024): One thing I noticed (with the help of an llm) is that llama.cpp shows fma = 1 while ollama shows it as 0.
Author
Owner

@phly95 commented on GitHub (Aug 13, 2024):

I also do not see a CUDA 12 runner in AppData\Local\Programs\Ollama\ollama_runners , which may also contribute to the slowdown.

<!-- gh-comment-id:2286345843 --> @phly95 commented on GitHub (Aug 13, 2024): I also do not see a CUDA 12 runner in AppData\Local\Programs\Ollama\ollama_runners , which may also contribute to the slowdown.
Author
Owner

@phly95 commented on GitHub (Aug 13, 2024):

https://github.com/ollama/ollama/issues/4958 seems like a CUDA 12 backend has been added in a fork, it's just not merged into upstream yet.

<!-- gh-comment-id:2286361019 --> @phly95 commented on GitHub (Aug 13, 2024): https://github.com/ollama/ollama/issues/4958 seems like a CUDA 12 backend has been added in a fork, it's just not merged into upstream yet.
Author
Owner

@rick-github commented on GitHub (Aug 13, 2024):

It's quite possible that the difference in build environment can be an effect. Note however that you are not comparing the same model: llama.cpp is using gemma-2-2b-it-Q4_K_M.gguf and ollama is using gemma2:2b-instruct-q4_0. Notably, the tensor mix and model size are different.

gemma2:2b-instruct-q4_0

llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
model size       = 1.51 GiB (4.97 BPW)

gemma-2-2b-it-Q4_K_M.gguf

llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
model size       = 1.59 GiB (5.21 BPW)

If you want to exclude this as a cause, you can try running llama.cpp with the ollama model (not that I expect it to make a significant difference, but apples to apples):

.\llama-cli -m C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
<!-- gh-comment-id:2286630771 --> @rick-github commented on GitHub (Aug 13, 2024): It's quite possible that the difference in build environment can be an effect. Note however that you are not comparing the same model: llama.cpp is using gemma-2-2b-it-Q4_K_M.gguf and ollama is using gemma2:2b-instruct-q4_0. Notably, the tensor mix and model size are different. gemma2:2b-instruct-q4_0 ``` llama_model_loader: - type q4_0: 182 tensors llama_model_loader: - type q6_K: 1 tensors model size = 1.51 GiB (4.97 BPW) ``` gemma-2-2b-it-Q4_K_M.gguf ``` llama_model_loader: - type q4_K: 156 tensors llama_model_loader: - type q6_K: 27 tensors model size = 1.59 GiB (5.21 BPW) ``` If you want to exclude this as a cause, you can try running llama.cpp with the ollama model (not that I expect it to make a significant difference, but apples to apples): ``` .\llama-cli -m C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv ```
Author
Owner

@dhiltgen commented on GitHub (Aug 13, 2024):

@phly95 I tried a custom build with cuda v12 and adjusted the cmake flags to match what you have in your llama.cpp system info but I'm not seeing a significant performance difference. Could you share more details about how you build llama.cpp?

<!-- gh-comment-id:2287402578 --> @dhiltgen commented on GitHub (Aug 13, 2024): @phly95 I tried a custom build with cuda v12 and adjusted the cmake flags to match what you have in your llama.cpp `system info` but I'm not seeing a significant performance difference. Could you share more details about how you build llama.cpp?
Author
Owner

@mxmp210 commented on GitHub (Aug 23, 2024):

There's one more difference between the two where ollama version is detecting 8/16 threads and llama.cpp is showing 16/16 threads. Can you confirm your CPU has full 16 cores and no SMT? - llama.cpp has merged code to address this under windows but upstream is still waiting on the update.

<!-- gh-comment-id:2306867391 --> @mxmp210 commented on GitHub (Aug 23, 2024): There's one more difference between the two where ollama version is detecting 8/16 threads and llama.cpp is showing 16/16 threads. Can you confirm your CPU has full 16 cores and no SMT? - llama.cpp has merged code to address this under windows but upstream is still waiting on the update.
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

We've improved the default thread algorithm recently, which may help. We also weren't compiling explicitly for CC 8.6 for windows which could also be contributing to performance issues.

@phly95 can you try out the latest 0.4.0 RC build and see if that closes the performance gap on your system?

https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2430222677 --> @dhiltgen commented on GitHub (Oct 22, 2024): We've improved the default thread algorithm recently, which may help. We also weren't compiling explicitly for CC 8.6 for windows which could also be contributing to performance issues. @phly95 can you try out the latest 0.4.0 RC build and see if that closes the performance gap on your system? https://github.com/ollama/ollama/releases
Author
Owner

@pdevine commented on GitHub (Jan 15, 2025):

I'm going to go ahead and close this since there hasn't been any update for a while.

<!-- gh-comment-id:2593969752 --> @pdevine commented on GitHub (Jan 15, 2025): I'm going to go ahead and close this since there hasn't been any update for a while.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29738