[GH-ISSUE #3736] v0.1.32 is running GPU capable models on CPU #48813

Closed
opened 2026-04-28 09:24:52 -05:00 by GiteaMirror · 38 comments
Owner

Originally created by @MarkWard0110 on GitHub (Apr 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3736

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I sometimes find that Ollama runs a model that should be on the GPU on the CPU. I just upgraded to v0.1.32. I am still trying to find out how to reproduce the issue. I don't know if it is related to me getting an error when loading one of the new models.
Hardware:
Intel Core i9 14900k
DDR5 6400MHz 2x48GB
Nvidia RTX 4070 TI Super 16GB

I have yet to make it through a successful benchmark run without it doing this.

This is the logs around when it loaded the model in the CPU RAM

Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.315Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.380Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.380Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.408Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.420Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.420Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.439Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="10961.0 MiB" used="901.1 MiB" available="270.6 MiB" kv="3200.0 MiB" fulloffload="368.0 MiB" partialoffload="444.1 MiB"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 39607"
Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 18 19:28:59 quorra ollama[567056]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140525997987712","timestamp":1713468539}
Apr 18 19:28:59 quorra ollama[567056]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"140525997987712","timestamp":1713468539}
Apr 18 19:28:59 quorra ollama[567056]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140525997987712","timestamp":1713468539}
Apr 18 19:28:59 quorra ollama[567056]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140525997987712","timestamp":1713468539,"total_threads":32}
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 (version GGUF V2)
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   4:                          llama.block_count u32              = 40
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32003]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32003]   = [0.000000, 0.000000, 0.000000, 0.0000...
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32003]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type  f32:   81 tensors
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type q4_0:  281 tensors
Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type q6_K:    1 tensors
Apr 18 19:28:59 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 262/32003 ).
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: format           = GGUF V2
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: arch             = llama
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: vocab type       = SPM
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_vocab          = 32003
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_merges         = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_ctx_train      = 4096
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd           = 5120
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_head           = 40
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_head_kv        = 40
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_layer          = 40
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_rot            = 128
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k    = 128
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v    = 128
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_gqa            = 1
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa     = 5120
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa     = 5120
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_ff             = 13824
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_expert         = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_expert_used    = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: causal attn      = 1
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: pooling type     = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope type        = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope scaling     = linear
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: freq_base_train  = 10000.0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx  = 4096
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope_finetuned   = unknown
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv       = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner      = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_state      = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model type       = 13B
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model ftype      = Q4_0
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model params     = 13.02 B
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: general.name     = LLaMA v2
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: BOS token        = 1 '<s>'
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: EOS token        = 2 '</s>'
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: UNK token        = 0 '<unk>'
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: PAD token        = 0 '<unk>'
Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Apr 18 19:28:59 quorra ollama[1170]: llm_load_tensors: ggml ctx size =    0.14 MiB
Apr 18 19:29:01 quorra ollama[1170]: llm_load_tensors:        CPU buffer size =  7023.92 MiB
Apr 18 19:29:01 quorra ollama[1170]: ...................................................................................................
Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_ctx      = 4096
Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_batch    = 512
Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_ubatch   = 512
Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: freq_base  = 10000.0
Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1
Apr 18 19:29:02 quorra ollama[1170]: llama_kv_cache_init:        CPU KV buffer size =  3200.00 MiB
Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: KV self size  = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB
Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model:        CPU  output buffer size =     0.14 MiB
Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model:        CPU compute buffer size =   368.01 MiB
Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: graph nodes  = 1286
Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: graph splits = 1
Apr 18 19:29:03 quorra ollama[567056]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140525997987712","timestamp":1713468543}
Apr 18 19:29:03 quorra ollama[567056]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":4096,"slot_id":0,"tid":"140525997987712","timestamp":1713468543}
Apr 18 19:29:03 quorra ollama[567056]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140525997987712","timestamp":1713468543}
Apr 18 19:29:03 quorra ollama[567056]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"39607","tid":"140525997987712","timestamp":1713468543}

The same model when used another time

Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.159Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.159Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.199Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.211Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.211Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.236Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=41 layers=41 required="10961.0 MiB" used="10961.0 MiB" available="15857.2 MiB" kv="3200.0 MiB" fulloffload="368.0 MiB" partialoffload="444.1 MiB"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --port 37295"
Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 18 19:49:56 quorra ollama[583304]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140087397396480","timestamp":1713469796}
Apr 18 19:49:56 quorra ollama[583304]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140087397396480","timestamp":1713469796}
Apr 18 19:49:56 quorra ollama[583304]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140087397396480","timestamp":1713469796,"total_threads":32}
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 (version GGUF V2)
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   4:                          llama.block_count u32              = 40
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32003]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32003]   = [0.000000, 0.000000, 0.000000, 0.0000...
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32003]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type  f32:   81 tensors
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type q4_0:  281 tensors
Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type q6_K:    1 tensors
Apr 18 19:49:56 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 262/32003 ).
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: format           = GGUF V2
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: arch             = llama
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: vocab type       = SPM
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_vocab          = 32003
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_merges         = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_ctx_train      = 4096
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd           = 5120
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_head           = 40
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_head_kv        = 40
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_layer          = 40
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_rot            = 128
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k    = 128
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v    = 128
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_gqa            = 1
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa     = 5120
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa     = 5120
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_ff             = 13824
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_expert         = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_expert_used    = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: causal attn      = 1
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: pooling type     = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope type        = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope scaling     = linear
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: freq_base_train  = 10000.0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx  = 4096
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope_finetuned   = unknown
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv       = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner      = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_state      = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model type       = 13B
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model ftype      = Q4_0
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model params     = 13.02 B
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: general.name     = LLaMA v2
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: BOS token        = 1 '<s>'
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: EOS token        = 2 '</s>'
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: UNK token        = 0 '<unk>'
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: PAD token        = 0 '<unk>'
Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices:
Apr 18 19:49:56 quorra ollama[1170]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: ggml ctx size =    0.28 MiB
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloading 40 repeating layers to GPU
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloading non-repeating layers to GPU
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloaded 41/41 layers to GPU
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors:        CPU buffer size =    87.90 MiB
Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors:      CUDA0 buffer size =  6936.02 MiB
Apr 18 19:49:56 quorra ollama[1170]: ...................................................................................................
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_ctx      = 4096
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_batch    = 512
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_ubatch   = 512
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: freq_base  = 10000.0
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1
Apr 18 19:49:56 quorra ollama[1170]: llama_kv_cache_init:      CUDA0 KV buffer size =  3200.00 MiB
Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: KV self size  = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB
Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model:      CUDA0 compute buffer size =   368.00 MiB
Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host compute buffer size =    18.01 MiB
Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: graph nodes  = 1286
Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: graph splits = 2
Apr 18 19:49:57 quorra ollama[583304]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140087397396480","timestamp":1713469797}
Apr 18 19:49:57 quorra ollama[583304]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":4096,"slot_id":0,"tid":"140087397396480","timestamp":1713469797}
Apr 18 19:49:57 quorra ollama[583304]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140087397396480","timestamp":1713469797}
Apr 18 19:49:57 quorra ollama[583304]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"37295","tid":"140087397396480","timestamp":1713469797}

I wonder if this is related. Here is an error I get when I attempt to load DBRX

Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305"
Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675}
Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32}
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest))
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   0:                       general.architecture str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   1:                               general.name str              = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f32:   81 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type  f16:   40 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0:  201 tensors
Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K:    1 tensors
Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ).
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch             = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type       = BPE
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab          = 100352
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges         = 100000
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train      = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd           = 6144
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head           = 48
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv        = 8
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer          = 40
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot            = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v    = 128
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa            = 6
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa     = 1024
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv      = 8.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff             = 10752
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert         = 16
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used    = 4
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn      = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type     = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type        = 2
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling     = linear
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train  = 500000.0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned   = unknown
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv       = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type       = 16x12B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype      = Q4_0
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params     = 131.60 B
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size       = 69.09 GiB (4.51 BPW)
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name     = dbrx
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token        = 100277 '<|pad|>'
Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token         = 128 'Ä'
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices:
Apr 18 18:57:55 quorra ollama[1170]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size =    0.74 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:        CPU buffer size = 70752.49 MiB
Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors:      CUDA0 buffer size = 13987.88 MiB
Apr 18 18:58:19 quorra ollama[1170]: ....................................................................................................
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx      = 2048
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch    = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch   = 512
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base  = 500000.0
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:  CUDA_Host KV buffer size =   256.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init:      CUDA0 KV buffer size =    64.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.41 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:      CUDA0 compute buffer size =  1794.00 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes  = 2886
Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325
Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
Apr 18 18:58:20 quorra ollama[1170]:   current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526
Apr 18 18:58:20 quorra ollama[1170]:   cublasCreate_v2(&cublas_handles[device])
Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n  current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n  cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\""
Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s |      10.0.0.123 | POST     "/api/generate"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.32

Originally created by @MarkWard0110 on GitHub (Apr 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3736 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I sometimes find that Ollama runs a model that should be on the GPU on the CPU. I just upgraded to v0.1.32. I am still trying to find out how to reproduce the issue. I don't know if it is related to me getting an error when loading one of the new models. Hardware: Intel Core i9 14900k DDR5 6400MHz 2x48GB Nvidia RTX 4070 TI Super 16GB I have yet to make it through a successful benchmark run without it doing this. This is the logs around when it loaded the model in the CPU RAM ``` Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.315Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.380Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.380Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.381Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.408Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.419Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.420Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.420Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.439Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="10961.0 MiB" used="901.1 MiB" available="270.6 MiB" kv="3200.0 MiB" fulloffload="368.0 MiB" partialoffload="444.1 MiB" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 39607" Apr 18 19:28:59 quorra ollama[1170]: time=2024-04-18T19:28:59.450Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 18 19:28:59 quorra ollama[567056]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140525997987712","timestamp":1713468539} Apr 18 19:28:59 quorra ollama[567056]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"140525997987712","timestamp":1713468539} Apr 18 19:28:59 quorra ollama[567056]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140525997987712","timestamp":1713468539} Apr 18 19:28:59 quorra ollama[567056]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140525997987712","timestamp":1713468539,"total_threads":32} Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 (version GGUF V2) Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 0: general.architecture str = llama Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 1: general.name str = LLaMA v2 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 2: llama.context_length u32 = 4096 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 4: llama.block_count u32 = 40 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 11: general.file_type u32 = 2 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000... Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type f32: 81 tensors Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type q4_0: 281 tensors Apr 18 19:28:59 quorra ollama[1170]: llama_model_loader: - type q6_K: 1 tensors Apr 18 19:28:59 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 262/32003 ). Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: format = GGUF V2 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: arch = llama Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: vocab type = SPM Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_vocab = 32003 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_merges = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_ctx_train = 4096 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd = 5120 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_head = 40 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_head_kv = 40 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_layer = 40 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_rot = 128 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k = 128 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v = 128 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_gqa = 1 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa = 5120 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa = 5120 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_ff = 13824 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_expert = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_expert_used = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: causal attn = 1 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: pooling type = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope type = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope scaling = linear Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: freq_base_train = 10000.0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx = 4096 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: rope_finetuned = unknown Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_d_state = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank = 0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model type = 13B Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model ftype = Q4_0 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model params = 13.02 B Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: general.name = LLaMA v2 Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: BOS token = 1 '<s>' Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: EOS token = 2 '</s>' Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: UNK token = 0 '<unk>' Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: PAD token = 0 '<unk>' Apr 18 19:28:59 quorra ollama[1170]: llm_load_print_meta: LF token = 13 '<0x0A>' Apr 18 19:28:59 quorra ollama[1170]: llm_load_tensors: ggml ctx size = 0.14 MiB Apr 18 19:29:01 quorra ollama[1170]: llm_load_tensors: CPU buffer size = 7023.92 MiB Apr 18 19:29:01 quorra ollama[1170]: ................................................................................................... Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_ctx = 4096 Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_batch = 512 Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: n_ubatch = 512 Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: freq_base = 10000.0 Apr 18 19:29:01 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1 Apr 18 19:29:02 quorra ollama[1170]: llama_kv_cache_init: CPU KV buffer size = 3200.00 MiB Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: CPU output buffer size = 0.14 MiB Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: CPU compute buffer size = 368.01 MiB Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: graph nodes = 1286 Apr 18 19:29:02 quorra ollama[1170]: llama_new_context_with_model: graph splits = 1 Apr 18 19:29:03 quorra ollama[567056]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140525997987712","timestamp":1713468543} Apr 18 19:29:03 quorra ollama[567056]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":4096,"slot_id":0,"tid":"140525997987712","timestamp":1713468543} Apr 18 19:29:03 quorra ollama[567056]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140525997987712","timestamp":1713468543} Apr 18 19:29:03 quorra ollama[567056]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"39607","tid":"140525997987712","timestamp":1713468543} ``` The same model when used another time ``` Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.159Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.159Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.161Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.199Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.211Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.211Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.212Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.236Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=41 layers=41 required="10961.0 MiB" used="10961.0 MiB" available="15857.2 MiB" kv="3200.0 MiB" fulloffload="368.0 MiB" partialoffload="444.1 MiB" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --port 37295" Apr 18 19:49:56 quorra ollama[1170]: time=2024-04-18T19:49:56.249Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 18 19:49:56 quorra ollama[583304]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140087397396480","timestamp":1713469796} Apr 18 19:49:56 quorra ollama[583304]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140087397396480","timestamp":1713469796} Apr 18 19:49:56 quorra ollama[583304]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140087397396480","timestamp":1713469796,"total_threads":32} Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-b17551ffad6537e746d58ca02744788b230e7e30d4796976917e6c589518c830 (version GGUF V2) Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 0: general.architecture str = llama Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 1: general.name str = LLaMA v2 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 2: llama.context_length u32 = 4096 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 4: llama.block_count u32 = 40 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 11: general.file_type u32 = 2 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32003] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32003] = [0.000000, 0.000000, 0.000000, 0.0000... Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32003] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type f32: 81 tensors Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type q4_0: 281 tensors Apr 18 19:49:56 quorra ollama[1170]: llama_model_loader: - type q6_K: 1 tensors Apr 18 19:49:56 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 262/32003 ). Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: format = GGUF V2 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: arch = llama Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: vocab type = SPM Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_vocab = 32003 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_merges = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_ctx_train = 4096 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd = 5120 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_head = 40 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_head_kv = 40 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_layer = 40 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_rot = 128 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k = 128 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v = 128 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_gqa = 1 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa = 5120 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa = 5120 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_ff = 13824 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_expert = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_expert_used = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: causal attn = 1 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: pooling type = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope type = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope scaling = linear Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: freq_base_train = 10000.0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx = 4096 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: rope_finetuned = unknown Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_d_state = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank = 0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model type = 13B Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model ftype = Q4_0 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model params = 13.02 B Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: model size = 6.86 GiB (4.53 BPW) Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: general.name = LLaMA v2 Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: BOS token = 1 '<s>' Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: EOS token = 2 '</s>' Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: UNK token = 0 '<unk>' Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: PAD token = 0 '<unk>' Apr 18 19:49:56 quorra ollama[1170]: llm_load_print_meta: LF token = 13 '<0x0A>' Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 18 19:49:56 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices: Apr 18 19:49:56 quorra ollama[1170]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: ggml ctx size = 0.28 MiB Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloading 40 repeating layers to GPU Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloading non-repeating layers to GPU Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: offloaded 41/41 layers to GPU Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: CPU buffer size = 87.90 MiB Apr 18 19:49:56 quorra ollama[1170]: llm_load_tensors: CUDA0 buffer size = 6936.02 MiB Apr 18 19:49:56 quorra ollama[1170]: ................................................................................................... Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_ctx = 4096 Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_batch = 512 Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: n_ubatch = 512 Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: freq_base = 10000.0 Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1 Apr 18 19:49:56 quorra ollama[1170]: llama_kv_cache_init: CUDA0 KV buffer size = 3200.00 MiB Apr 18 19:49:56 quorra ollama[1170]: llama_new_context_with_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: CUDA0 compute buffer size = 368.00 MiB Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host compute buffer size = 18.01 MiB Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: graph nodes = 1286 Apr 18 19:49:57 quorra ollama[1170]: llama_new_context_with_model: graph splits = 2 Apr 18 19:49:57 quorra ollama[583304]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140087397396480","timestamp":1713469797} Apr 18 19:49:57 quorra ollama[583304]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":4096,"slot_id":0,"tid":"140087397396480","timestamp":1713469797} Apr 18 19:49:57 quorra ollama[583304]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"140087397396480","timestamp":1713469797} Apr 18 19:49:57 quorra ollama[583304]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"37295","tid":"140087397396480","timestamp":1713469797} ``` I wonder if this is related. Here is an error I get when I attempt to load DBRX ``` Apr 18 18:57:54 quorra ollama[1170]: time=2024-04-18T18:57:54.713Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.028Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.039Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.128Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.141Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama1615772994/runners/cuda_v11/libcudart.so.11.0]" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.142Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.174Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=8 layers=8 required="71518.7 MiB" used="14828.9 MiB" available="15857.2 MiB" kv="320.0 MiB" fulloffload="320.0 MiB" partialoffload="320.0 MiB" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.186Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama1615772994/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 8 --port 41305" Apr 18 18:57:55 quorra ollama[1170]: time=2024-04-18T18:57:55.187Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 18 18:57:55 quorra ollama[20191]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140044021682176","timestamp":1713466675} Apr 18 18:57:55 quorra ollama[20191]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140044021682176","timestamp":1713466675,"total_threads":32} Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1d12441f19436dbb0bcc4067e9d47921b944ef4a87b35873aa430e85e91a93c8 (version GGUF V3 (latest)) Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 0: general.architecture str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 1: general.name str = dbrx Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 2: dbrx.block_count u32 = 40 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 3: dbrx.context_length u32 = 32768 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 4: dbrx.embedding_length u32 = 6144 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 5: dbrx.feed_forward_length u32 = 10752 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 6: dbrx.attention.head_count u32 = 48 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 7: dbrx.attention.head_count_kv u32 = 8 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 8: dbrx.rope.freq_base f32 = 500000.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 9: dbrx.attention.clamp_kqv f32 = 8.000000 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 10: general.file_type u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 11: dbrx.expert_count u32 = 16 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 12: dbrx.expert_used_count u32 = 4 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 13: dbrx.attention.layer_norm_epsilon f32 = 0.000010 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 20: tokenizer.ggml.unknown_token_id u32 = 100257 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 100277 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 22: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f32: 81 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type f16: 40 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q4_0: 201 tensors Apr 18 18:57:55 quorra ollama[1170]: llama_model_loader: - type q6_K: 1 tensors Apr 18 18:57:55 quorra ollama[1170]: llm_load_vocab: special tokens definition check successful ( 96/100352 ). Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: format = GGUF V3 (latest) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: arch = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: vocab type = BPE Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_vocab = 100352 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_merges = 100000 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ctx_train = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd = 6144 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head = 48 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_head_kv = 8 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_layer = 40 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_rot = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_k = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_head_v = 128 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_gqa = 6 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_k_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_embd_v_gqa = 1024 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_eps = 1.0e-05 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_clamp_kqv = 8.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_ff = 10752 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert = 16 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_expert_used = 4 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: causal attn = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: pooling type = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope type = 2 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope scaling = linear Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_base_train = 500000.0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: freq_scale_train = 1 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: rope_finetuned = unknown Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_conv = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_inner = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_d_state = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: ssm_dt_rank = 0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model type = 16x12B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model ftype = Q4_0 Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model params = 131.60 B Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: model size = 69.09 GiB (4.51 BPW) Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: general.name = dbrx Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: BOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: EOS token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: UNK token = 100257 '<|endoftext|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: PAD token = 100277 '<|pad|>' Apr 18 18:57:55 quorra ollama[1170]: llm_load_print_meta: LF token = 128 'Ä' Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 18 18:57:55 quorra ollama[1170]: ggml_cuda_init: found 1 CUDA devices: Apr 18 18:57:55 quorra ollama[1170]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 18 18:57:55 quorra ollama[1170]: llm_load_tensors: ggml ctx size = 0.74 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloading 8 repeating layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: offloaded 8/41 layers to GPU Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CPU buffer size = 70752.49 MiB Apr 18 18:58:18 quorra ollama[1170]: llm_load_tensors: CUDA0 buffer size = 13987.88 MiB Apr 18 18:58:19 quorra ollama[1170]: .................................................................................................... Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ctx = 2048 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_batch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: n_ubatch = 512 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_base = 500000.0 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: freq_scale = 1 Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA0 compute buffer size = 1794.00 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph nodes = 2886 Apr 18 18:58:19 quorra ollama[1170]: llama_new_context_with_model: graph splits = 325 Apr 18 18:58:20 quorra ollama[1170]: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED Apr 18 18:58:20 quorra ollama[1170]: current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526 Apr 18 18:58:20 quorra ollama[1170]: cublasCreate_v2(&cublas_handles[device]) Apr 18 18:58:20 quorra ollama[1170]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" Apr 18 18:58:20 quorra ollama[1170]: time=2024-04-18T18:58:20.740Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\"" Apr 18 18:58:20 quorra ollama[1170]: [GIN] 2024/04/18 - 18:58:20 | 500 | 26.029470037s | 10.0.0.123 | POST "/api/generate" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.32
GiteaMirror added the bugnvidia labels 2026-04-28 09:25:04 -05:00
Author
Owner

@taozhiyuai commented on GitHub (Apr 18, 2024):

happen when v-ram is not enough to run on GPU+V-RAM, so ollama runs it on CPU+HD

<!-- gh-comment-id:2065423194 --> @taozhiyuai commented on GitHub (Apr 18, 2024): happen when v-ram is not enough to run on GPU+V-RAM, so ollama runs it on CPU+HD
Author
Owner

@eng1n88r commented on GitHub (Apr 18, 2024):

Is it possible to utilize RAM + VRAM?
image

I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU?

Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization.

<!-- gh-comment-id:2065506517 --> @eng1n88r commented on GitHub (Apr 18, 2024): Is it possible to utilize RAM + VRAM? ![image](https://github.com/ollama/ollama/assets/883804/0b8a9ae6-f1bd-410d-9dd2-256506120737) I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU? Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization.
Author
Owner

@taozhiyuai commented on GitHub (Apr 19, 2024):

Is it possible to utilize RAM + VRAM? image

I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU?

Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization.

I think it is working on CPU+HD mode

<!-- gh-comment-id:2065617107 --> @taozhiyuai commented on GitHub (Apr 19, 2024): > Is it possible to utilize RAM + VRAM? ![image](https://private-user-images.githubusercontent.com/883804/323789554-0b8a9ae6-f1bd-410d-9dd2-256506120737.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTM0OTIxOTcsIm5iZiI6MTcxMzQ5MTg5NywicGF0aCI6Ii84ODM4MDQvMzIzNzg5NTU0LTBiOGE5YWU2LWYxYmQtNDEwZC05ZGQyLTI1NjUwNjEyMDczNy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQxOVQwMTU4MTdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iNzk4MzA1MzU0ZDdmZTZhOWU5OWU1N2M4YWQ1MmI4ZGQ2ZjFhZTIwYzgzOGMwZGIzMzUwNjQ4MWE2MzEwMzg5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.p3Ov4Hp0qfQ86TXbU-MJ3BuA81fv6Z9fDzX-Dg3Xdoc) > > I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU? > > Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization. I think it is working on CPU+HD mode
Author
Owner

@eng1n88r commented on GitHub (Apr 19, 2024):

I think it is working on CPU+HD mode

I think I see that based on the utilization percentage. To rephrase my question, is it possible to fully utilize GPU and using Unified Memory reduce CPU usage?

<!-- gh-comment-id:2065646991 --> @eng1n88r commented on GitHub (Apr 19, 2024): > I think it is working on CPU+HD mode I think I see that based on the utilization percentage. To rephrase my question, is it possible to fully utilize GPU and using [Unified Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming) reduce CPU usage?
Author
Owner

@debug-deng commented on GitHub (Apr 19, 2024):

I think it is working on CPU+HD mode

I think I see that based on the utilization percentage. To rephrase my question, is it possible to fully utilize GPU and using Unified Memory reduce CPU usage?

Hello, I am encountering a similar issue where my GPU utilization is only at 20% while my RAM consumption has reached 90%. I know that setting num_gpu can address this, but it requires modifying the model.

<!-- gh-comment-id:2065925796 --> @debug-deng commented on GitHub (Apr 19, 2024): > > I think it is working on CPU+HD mode > > I think I see that based on the utilization percentage. To rephrase my question, is it possible to fully utilize GPU and using [Unified Memory](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming) reduce CPU usage? Hello, I am encountering a similar issue where my GPU utilization is only at 20% while my RAM consumption has reached 90%. I know that setting num_gpu can address this, but it requires modifying the model.
Author
Owner

@taozhiyuai commented on GitHub (Apr 19, 2024):

Is it possible to utilize RAM + VRAM? image

I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU?

Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization.

my Mac is m3 max 128gb. command r plus Q8 use around 101GB on GPU by lm studio

<!-- gh-comment-id:2066111141 --> @taozhiyuai commented on GitHub (Apr 19, 2024): > Is it possible to utilize RAM + VRAM? ![image](https://private-user-images.githubusercontent.com/883804/323789554-0b8a9ae6-f1bd-410d-9dd2-256506120737.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTM1MTYyMDIsIm5iZiI6MTcxMzUxNTkwMiwicGF0aCI6Ii84ODM4MDQvMzIzNzg5NTU0LTBiOGE5YWU2LWYxYmQtNDEwZC05ZGQyLTI1NjUwNjEyMDczNy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNDE5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDQxOVQwODM4MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lZjUwYjFhMDJiMzIwMTFlZWM5MzBiNmMwNWNjZTM0MTQwNjVlMGUyYmRmYjMzZjcwN2VjMWIwNWFjMTYzZjAzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.8ZlqjWXXouxg03LtVAX0oZ1vRV8KDsaAC2LrT6sl5Zc) > > I'm trying to run ~40G model locally on 4090 (24GB) and I have 128GB of RAM from which almost 64GB dedicated for GPU usage (based on screenshot above). Sure, RAM is slower than dedicated GPU memory, but would it make sense to use it instead of CPU? > > Currently 40GB model loads the CPU at ~95-100% and 20% is GPU utilization. my Mac is m3 max 128gb. command r plus Q8 use around 101GB on GPU by lm studio
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

It has happened again. This never happened with the previous versions of Ollama. I have a program that will fetch the list of models available, sort the list randomly, execute 40 prompts against each model, and record the time and TPS. These are the same models I have been running against the previous versions of Ollama. This version, v0.1.32, appears to not find the GPU and load a model into the CPU.
The only way I have reproduced this is by running the program. I have not been able to reproduce it by just running ollama run.
I have ruled out the new models. I have removed them, and I will still get a run with a model that should have been on the GPU running on the CPU.
I saved my previous runs and can compare the TPS between the previous versions and this version for each of the models.

The list of models that I am running right now are

codellama:13b-instruct          9f438cb9cd58    7.4 GB  20 hours ago
codellama:34b-instruct          685be00e1532    19 GB   20 hours ago
codellama:70b-instruct          e59b580dfce7    38 GB   20 hours ago
codellama:7b-instruct           8fdf8f752f6e    3.8 GB  20 hours ago
command-r:35b                   b8cdfff0263c    20 GB   20 hours ago
deepseek-coder:1.3b-instruct    3ddd2d3fc8d2    776 MB  20 hours ago
deepseek-coder:33b-instruct     acec7c0b0fd9    18 GB   20 hours ago
deepseek-coder:6.7b-instruct    ce298d984115    3.8 GB  20 hours ago
gemma:2b-instruct               030ee63283b5    1.6 GB  20 hours ago
gemma:7b-instruct               a72c7f4d0a15    5.0 GB  20 hours ago
llama2:13b-chat                 d475bf4c50bc    7.4 GB  20 hours ago
llama2:70b-chat                 e7f6c06ffef4    38 GB   20 hours ago
llama2:7b-chat                  78e26419b446    3.8 GB  20 hours ago
mistral:7b                      61e88e884507    4.1 GB  20 hours ago
mixtral:8x7b                    7708c059a8bb    26 GB   20 hours ago
neural-chat:7b                  89fa737d3b85    4.1 GB  20 hours ago
orca-mini:13b                   1b4877c90807    7.4 GB  20 hours ago
orca-mini:3b                    2dbd9f439647    2.0 GB  20 hours ago
orca-mini:70b                   f184c0860491    38 GB   20 hours ago
orca-mini:7b                    9c9618e2e895    3.8 GB  20 hours ago
orca2:13b                       a8dcfac3ac32    7.4 GB  20 hours ago
orca2:7b                        ea98cc422de3    3.8 GB  20 hours ago
phi:2.7b                        e2fd6321a5fe    1.6 GB  20 hours ago
qwen:14b                        80362ced6553    8.2 GB  20 hours ago
qwen:32b                        26e7e8447f5d    18 GB   20 hours ago
qwen:4b                         d53d04290064    2.3 GB  20 hours ago
qwen:72b                        e1c64582de5c    41 GB   20 hours ago
qwen:7b                         2091ee8c8d8f    4.5 GB  20 hours ago
wizardlm2:7b                    c9b1aff820f2    4.1 GB  20 hours ago

Here is another model that went CPU when it should have been GPU. My GPU has 16GB of ram. I am running Ubuntu Server and the GPU is not used for video out. The onboard video is video out.

Apr 19 12:05:22 quorra ollama[1180]: time=2024-04-19T12:05:22.897Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.215Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.216Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.216Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.217Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.217Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.244Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.276Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5749.1 MiB" used="1248.6 MiB" available="226.6 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 34181"
Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 19 12:05:23 quorra ollama[2231250]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139995892541312","timestamp":1713528323}
Apr 19 12:05:23 quorra ollama[2231250]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"139995892541312","timestamp":1713528323}
Apr 19 12:05:23 quorra ollama[2231250]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139995892541312","timestamp":1713528323}
Apr 19 12:05:23 quorra ollama[2231250]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139995892541312","timestamp":1713528323,"total_threads":32}
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: loaded meta data with 20 key-value pairs and 387 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 (version GGUF V3 (latest))
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-beta-7B-Chat
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 32
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 4096
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 11008
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 32
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 32
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv   9:                qwen2.use_parallel_residual bool             = true
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  10:                       tokenizer.ggml.model str              = gpt2
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  12:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  13:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 151643
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  15:            tokenizer.ggml.padding_token_id u32              = 151643
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151643
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  18:               general.quantization_version u32              = 2
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv  19:                          general.file_type u32              = 2
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type  f32:  161 tensors
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type q4_0:  225 tensors
Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type q6_K:    1 tensors
Apr 19 12:05:23 quorra ollama[1180]: llm_load_vocab: special tokens definition check successful ( 293/151936 ).
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: arch             = qwen2
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: vocab type       = BPE
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_vocab          = 151936
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_merges         = 151387
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_ctx_train      = 32768
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd           = 4096
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_head           = 32
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_head_kv        = 32
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_layer          = 32
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_rot            = 128
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k    = 128
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v    = 128
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_gqa            = 1
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa     = 4096
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa     = 4096
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_ff             = 11008
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_expert         = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_expert_used    = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: causal attn      = 1
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: pooling type     = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope type        = 2
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope scaling     = linear
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: freq_base_train  = 10000.0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 1
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx  = 32768
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope_finetuned   = unknown
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv       = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner      = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_state      = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model type       = 7B
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model ftype      = Q4_0
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model params     = 7.72 B
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model size       = 4.20 GiB (4.67 BPW)
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: general.name     = Qwen2-beta-7B-Chat
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: LF token         = 148848 'ÄĬ'
Apr 19 12:05:23 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.15 MiB
Apr 19 12:05:24 quorra ollama[1180]: llm_load_tensors:        CPU buffer size =  4297.21 MiB
Apr 19 12:05:24 quorra ollama[1180]: ...................................................................................
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 10000.0
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1
Apr 19 12:05:25 quorra ollama[1180]: llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model:        CPU compute buffer size =   304.75 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1126
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1
Apr 19 12:05:25 quorra ollama[2231250]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139995892541312","timestamp":1713528325}
Apr 19 12:05:25 quorra ollama[2231250]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139995892541312","timestamp":1713528325}
Apr 19 12:05:25 quorra ollama[2231250]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139995892541312","timestamp":1713528325}
Apr 19 12:05:25 quorra ollama[2231250]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"34181","tid":"139995892541312","timestamp":1713528325}
Apr 19 12:05:25 quorra ollama[2231250]: {"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139995892541312","timestamp":1713528325}
<!-- gh-comment-id:2066531610 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): It has happened again. This never happened with the previous versions of Ollama. I have a program that will fetch the list of models available, sort the list randomly, execute 40 prompts against each model, and record the time and TPS. These are the same models I have been running against the previous versions of Ollama. This version, v0.1.32, appears to not find the GPU and load a model into the CPU. The only way I have reproduced this is by running the program. I have not been able to reproduce it by just running ollama run. I have ruled out the new models. I have removed them, and I will still get a run with a model that should have been on the GPU running on the CPU. I saved my previous runs and can compare the TPS between the previous versions and this version for each of the models. The list of models that I am running right now are ``` codellama:13b-instruct 9f438cb9cd58 7.4 GB 20 hours ago codellama:34b-instruct 685be00e1532 19 GB 20 hours ago codellama:70b-instruct e59b580dfce7 38 GB 20 hours ago codellama:7b-instruct 8fdf8f752f6e 3.8 GB 20 hours ago command-r:35b b8cdfff0263c 20 GB 20 hours ago deepseek-coder:1.3b-instruct 3ddd2d3fc8d2 776 MB 20 hours ago deepseek-coder:33b-instruct acec7c0b0fd9 18 GB 20 hours ago deepseek-coder:6.7b-instruct ce298d984115 3.8 GB 20 hours ago gemma:2b-instruct 030ee63283b5 1.6 GB 20 hours ago gemma:7b-instruct a72c7f4d0a15 5.0 GB 20 hours ago llama2:13b-chat d475bf4c50bc 7.4 GB 20 hours ago llama2:70b-chat e7f6c06ffef4 38 GB 20 hours ago llama2:7b-chat 78e26419b446 3.8 GB 20 hours ago mistral:7b 61e88e884507 4.1 GB 20 hours ago mixtral:8x7b 7708c059a8bb 26 GB 20 hours ago neural-chat:7b 89fa737d3b85 4.1 GB 20 hours ago orca-mini:13b 1b4877c90807 7.4 GB 20 hours ago orca-mini:3b 2dbd9f439647 2.0 GB 20 hours ago orca-mini:70b f184c0860491 38 GB 20 hours ago orca-mini:7b 9c9618e2e895 3.8 GB 20 hours ago orca2:13b a8dcfac3ac32 7.4 GB 20 hours ago orca2:7b ea98cc422de3 3.8 GB 20 hours ago phi:2.7b e2fd6321a5fe 1.6 GB 20 hours ago qwen:14b 80362ced6553 8.2 GB 20 hours ago qwen:32b 26e7e8447f5d 18 GB 20 hours ago qwen:4b d53d04290064 2.3 GB 20 hours ago qwen:72b e1c64582de5c 41 GB 20 hours ago qwen:7b 2091ee8c8d8f 4.5 GB 20 hours ago wizardlm2:7b c9b1aff820f2 4.1 GB 20 hours ago ``` Here is another model that went CPU when it should have been GPU. My GPU has 16GB of ram. I am running Ubuntu Server and the GPU is not used for video out. The onboard video is video out. ``` Apr 19 12:05:22 quorra ollama[1180]: time=2024-04-19T12:05:22.897Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.215Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.216Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.216Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.217Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.217Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.244Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.255Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.276Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5749.1 MiB" used="1248.6 MiB" available="226.6 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 34181" Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 19 12:05:23 quorra ollama[2231250]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"139995892541312","timestamp":1713528323} Apr 19 12:05:23 quorra ollama[2231250]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"139995892541312","timestamp":1713528323} Apr 19 12:05:23 quorra ollama[2231250]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"139995892541312","timestamp":1713528323} Apr 19 12:05:23 quorra ollama[2231250]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"139995892541312","timestamp":1713528323,"total_threads":32} Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: loaded meta data with 20 key-value pairs and 387 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-87f26aae09c7f052de93ff98a2282f05822cc6de4af1a2a159c5bd1acbd10ec4 (version GGUF V3 (latest)) Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 0: general.architecture str = qwen2 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 1: general.name str = Qwen2-beta-7B-Chat Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 2: qwen2.block_count u32 = 32 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 4096 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 11008 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 32 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 32 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 18: general.quantization_version u32 = 2 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - kv 19: general.file_type u32 = 2 Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type f32: 161 tensors Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type q4_0: 225 tensors Apr 19 12:05:23 quorra ollama[1180]: llama_model_loader: - type q6_K: 1 tensors Apr 19 12:05:23 quorra ollama[1180]: llm_load_vocab: special tokens definition check successful ( 293/151936 ). Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: format = GGUF V3 (latest) Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: arch = qwen2 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: vocab type = BPE Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_vocab = 151936 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_merges = 151387 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_ctx_train = 32768 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd = 4096 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_head = 32 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_head_kv = 32 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_layer = 32 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_rot = 128 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k = 128 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v = 128 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_gqa = 1 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa = 4096 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa = 4096 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_ff = 11008 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_expert = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_expert_used = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: causal attn = 1 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: pooling type = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope type = 2 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope scaling = linear Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: freq_base_train = 10000.0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 1 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: rope_finetuned = unknown Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_d_state = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank = 0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model type = 7B Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model ftype = Q4_0 Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model params = 7.72 B Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: model size = 4.20 GiB (4.67 BPW) Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: general.name = Qwen2-beta-7B-Chat Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>' Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: EOS token = 151643 '<|endoftext|>' Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>' Apr 19 12:05:23 quorra ollama[1180]: llm_load_print_meta: LF token = 148848 'ÄĬ' Apr 19 12:05:23 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.15 MiB Apr 19 12:05:24 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 4297.21 MiB Apr 19 12:05:24 quorra ollama[1180]: ................................................................................... Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_base = 10000.0 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1 Apr 19 12:05:25 quorra ollama[1180]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: CPU output buffer size = 0.60 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: CPU compute buffer size = 304.75 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1126 Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1 Apr 19 12:05:25 quorra ollama[2231250]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"139995892541312","timestamp":1713528325} Apr 19 12:05:25 quorra ollama[2231250]: {"function":"initialize","level":"INFO","line":457,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"139995892541312","timestamp":1713528325} Apr 19 12:05:25 quorra ollama[2231250]: {"function":"main","level":"INFO","line":3064,"msg":"model loaded","tid":"139995892541312","timestamp":1713528325} Apr 19 12:05:25 quorra ollama[2231250]: {"function":"main","hostname":"127.0.0.1","level":"INFO","line":3267,"msg":"HTTP server listening","n_threads_http":"31","port":"34181","tid":"139995892541312","timestamp":1713528325} Apr 19 12:05:25 quorra ollama[2231250]: {"function":"update_slots","level":"INFO","line":1578,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"139995892541312","timestamp":1713528325} ```
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

@taozhiyuai

happen when v-ram is not enough to run on GPU+V-RAM, so ollama runs it on CPU+HD

Ollama v0.1.32 is running models that have always run on GPU sometimes on the CPU. I am benchmarking, and the runs under Ollama v0.1.32 are loading models that have always loaded on the GPU on the CPU (randomly).
Also, note that when Ollama loaded the model, note the logs and how they differ when it loads on the CPU vs GPU for the same model.

Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5749.1 MiB" used="1248.6 MiB" available="226.6 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB"


Apr 19 12:05:23 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.15 MiB
Apr 19 12:05:24 quorra ollama[1180]: llm_load_tensors:        CPU buffer size =  4297.21 MiB
Apr 19 12:05:24 quorra ollama[1180]: ...................................................................................
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 10000.0
Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1
Apr 19 12:05:25 quorra ollama[1180]: llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model:        CPU  output buffer size =     0.60 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model:        CPU compute buffer size =   304.75 MiB
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1126
Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1

It completely ignored the GPU. This does not happen every time. The following is another load of the model where it did use the GPU.

Apr 19 13:18:44 quorra ollama[1180]: time=2024-04-19T13:18:44.276Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=33 layers=33 required="5749.1 MiB" used="5749.1 MiB" available="15857.2 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB"

Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices:
Apr 19 13:18:44 quorra ollama[1180]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.30 MiB
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloading 32 repeating layers to GPU
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloading non-repeating layers to GPU
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloaded 33/33 layers to GPU
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors:        CPU buffer size =   333.84 MiB
Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors:      CUDA0 buffer size =  3963.38 MiB
Apr 19 13:18:44 quorra ollama[1180]: ...................................................................................
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 10000.0
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1
Apr 19 13:18:44 quorra ollama[1180]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.60 MiB
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model:      CUDA0 compute buffer size =   304.75 MiB
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1126
Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: graph splits = 2

When a model is too big for the GPU it spreads it like the following.

Apr 19 12:42:37 quorra ollama[1180]: time=2024-04-19T12:42:37.364Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=39 layers=39 required="19193.1 MiB" used="15598.2 MiB" available="15857.2 MiB" kv="384.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB"

Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices:
Apr 19 12:42:37 quorra ollama[1180]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 19 12:42:37 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.33 MiB
Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: offloading 39 repeating layers to GPU
Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: offloaded 39/49 layers to GPU
Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors:        CPU buffer size = 18168.73 MiB
Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors:      CUDA0 buffer size = 14481.19 MiB
Apr 19 12:42:45 quorra ollama[1180]: ....................................................................................................
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 1000000.0
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1
Apr 19 12:42:45 quorra ollama[1180]: llama_kv_cache_init:  CUDA_Host KV buffer size =    72.00 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_kv_cache_init:      CUDA0 KV buffer size =   312.00 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model:      CUDA0 compute buffer size =   340.00 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1542
Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: graph splits = 103

<!-- gh-comment-id:2066577945 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): @taozhiyuai > happen when v-ram is not enough to run on GPU+V-RAM, so ollama runs it on CPU+HD Ollama v0.1.32 is running models that have always run on GPU sometimes on the CPU. I am benchmarking, and the runs under Ollama v0.1.32 are loading models that have always loaded on the GPU on the CPU (randomly). Also, note that when Ollama loaded the model, note the logs and how they differ when it loads on the CPU vs GPU for the same model. ``` Apr 19 12:05:23 quorra ollama[1180]: time=2024-04-19T12:05:23.287Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5749.1 MiB" used="1248.6 MiB" available="226.6 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB" Apr 19 12:05:23 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.15 MiB Apr 19 12:05:24 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 4297.21 MiB Apr 19 12:05:24 quorra ollama[1180]: ................................................................................... Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_base = 10000.0 Apr 19 12:05:24 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1 Apr 19 12:05:25 quorra ollama[1180]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: CPU output buffer size = 0.60 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: CPU compute buffer size = 304.75 MiB Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1126 Apr 19 12:05:25 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1 ``` It completely ignored the GPU. This does not happen every time. The following is another load of the model where it did use the GPU. ``` Apr 19 13:18:44 quorra ollama[1180]: time=2024-04-19T13:18:44.276Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=33 layers=33 required="5749.1 MiB" used="5749.1 MiB" available="15857.2 MiB" kv="1024.0 MiB" fulloffload="304.8 MiB" partialoffload="791.6 MiB" Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 19 13:18:44 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices: Apr 19 13:18:44 quorra ollama[1180]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.30 MiB Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloading 32 repeating layers to GPU Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloading non-repeating layers to GPU Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: offloaded 33/33 layers to GPU Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 333.84 MiB Apr 19 13:18:44 quorra ollama[1180]: llm_load_tensors: CUDA0 buffer size = 3963.38 MiB Apr 19 13:18:44 quorra ollama[1180]: ................................................................................... Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: freq_base = 10000.0 Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1 Apr 19 13:18:44 quorra ollama[1180]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host output buffer size = 0.60 MiB Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: CUDA0 compute buffer size = 304.75 MiB Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1126 Apr 19 13:18:44 quorra ollama[1180]: llama_new_context_with_model: graph splits = 2 ``` When a model is too big for the GPU it spreads it like the following. ``` Apr 19 12:42:37 quorra ollama[1180]: time=2024-04-19T12:42:37.364Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=39 layers=39 required="19193.1 MiB" used="15598.2 MiB" available="15857.2 MiB" kv="384.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB" Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 19 12:42:37 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices: Apr 19 12:42:37 quorra ollama[1180]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 19 12:42:37 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.33 MiB Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: offloading 39 repeating layers to GPU Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: offloaded 39/49 layers to GPU Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 18168.73 MiB Apr 19 12:42:43 quorra ollama[1180]: llm_load_tensors: CUDA0 buffer size = 14481.19 MiB Apr 19 12:42:45 quorra ollama[1180]: .................................................................................................... Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: freq_base = 1000000.0 Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 1 Apr 19 12:42:45 quorra ollama[1180]: llama_kv_cache_init: CUDA_Host KV buffer size = 72.00 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_kv_cache_init: CUDA0 KV buffer size = 312.00 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: CUDA0 compute buffer size = 340.00 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1542 Apr 19 12:42:45 quorra ollama[1180]: llama_new_context_with_model: graph splits = 103 ```
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

Another run and the model "deepseek-coder:6.7b-instruct" is running on the CPU when it otherwise runs on GPU.


Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.043Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.138Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.138Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.139Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.147Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.147Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.173Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.184Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.184Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.209Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5223.4 MiB" used="650.0 MiB" available="270.6 MiB" kv="1024.0 MiB" fulloffload="164.0 MiB" partialoffload="193.0 MiB"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 34059"
Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 19 15:43:19 quorra ollama[3559137]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140322160809856","timestamp":1713541399}
Apr 19 15:43:19 quorra ollama[3559137]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"140322160809856","timestamp":1713541399}
Apr 19 15:43:19 quorra ollama[3559137]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140322160809856","timestamp":1713541399}
Apr 19 15:43:19 quorra ollama[3559137]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140322160809856","timestamp":1713541399,"total_threads":32}
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb (version GGUF V3 (latest))
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   1:                               general.name str              = deepseek-ai
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  11:                    llama.rope.scaling.type str              = linear
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  12:                  llama.rope.scaling.factor f32              = 4.000000
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  13:                          general.file_type u32              = 2
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32256]   = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32256]   = [0.000000, 0.000000, 0.000000, 0.0000...
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32256]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,31757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 32013
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32021
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32014
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv  25:               general.quantization_version u32              = 2
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type  f32:   65 tensors
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type q4_0:  225 tensors
Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type q6_K:    1 tensors
Apr 19 15:43:19 quorra ollama[1180]: llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: arch             = llama
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: vocab type       = BPE
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_vocab          = 32256
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_merges         = 31757
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_ctx_train      = 16384
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd           = 4096
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_head           = 32
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_head_kv        = 32
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_layer          = 32
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_rot            = 128
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k    = 128
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v    = 128
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_gqa            = 1
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa     = 4096
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa     = 4096
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_ff             = 11008
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_expert         = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_expert_used    = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: causal attn      = 1
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: pooling type     = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope type        = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope scaling     = linear
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: freq_base_train  = 100000.0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 0.25
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope_finetuned   = unknown
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv       = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner      = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_state      = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model type       = 7B
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model ftype      = Q4_0
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model params     = 6.74 B
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: general.name     = deepseek-ai
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: BOS token        = 32013 '<|begin▁of▁sentence|>'
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: EOS token        = 32021 '<|EOT|>'
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: PAD token        = 32014 '<|end▁of▁sentence|>'
Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: LF token         = 126 'Ä'
Apr 19 15:43:19 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.11 MiB
Apr 19 15:43:19 quorra ollama[1180]: llm_load_tensors:        CPU buffer size =  3649.25 MiB
Apr 19 15:43:19 quorra ollama[1180]: ..................................................................................................
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 100000.0
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 0.25
Apr 19 15:43:19 quorra ollama[1180]: llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model:        CPU  output buffer size =     0.14 MiB
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model:        CPU compute buffer size =   164.01 MiB
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1030
Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1
Apr 19 15:43:20 quorra ollama[3559137]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140322160809856","timestamp":1713541400}

Without restarting the Ollama service. I swap the model by using ollama run and provide a different model to load. It loaded into the GPU. I then exit and run ollama run with the deep-seeker again and Ollama will load it into the GPU. Just before doing this the model was loaded onto the CPU.

The only software using the GPU is Ollama.

Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.567Z level=INFO source=routes.go:97 msg="changing loaded model"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.654Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.654Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.656Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.657Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.657Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.767Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:121 msg="Detecting GPU type"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.806Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=33 layers=33 required="5223.4 MiB" used="5223.4 MiB" available="15857.2 MiB" kv="1024.0 MiB" fulloffload="164.0 MiB" partialoffload="193.0 MiB"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 37141"
Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding"
Apr 19 16:39:06 quorra ollama[3928329]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140412729040896","timestamp":1713544746}
Apr 19 16:39:06 quorra ollama[3928329]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140412729040896","timestamp":1713544746}
Apr 19 16:39:06 quorra ollama[3928329]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140412729040896","timestamp":1713544746,"total_threads":32}
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb (version GGUF V3 (latest))
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   1:                               general.name str              = deepseek-ai
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 100000.000000
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  11:                    llama.rope.scaling.type str              = linear
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  12:                  llama.rope.scaling.factor f32              = 4.000000
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  13:                          general.file_type u32              = 2
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32256]   = ["!", "\"", "#", "$", "%", "&", "'", ...
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32256]   = [0.000000, 0.000000, 0.000000, 0.0000...
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32256]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,31757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 32013
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32021
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32014
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv  25:               general.quantization_version u32              = 2
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type  f32:   65 tensors
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type q4_0:  225 tensors
Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type q6_K:    1 tensors
Apr 19 16:39:06 quorra ollama[1180]: llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: format           = GGUF V3 (latest)
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: arch             = llama
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: vocab type       = BPE
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_vocab          = 32256
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_merges         = 31757
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_ctx_train      = 16384
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd           = 4096
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_head           = 32
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_head_kv        = 32
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_layer          = 32
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_rot            = 128
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k    = 128
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v    = 128
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_gqa            = 1
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa     = 4096
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa     = 4096
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_ff             = 11008
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_expert         = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_expert_used    = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: causal attn      = 1
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: pooling type     = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope type        = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope scaling     = linear
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: freq_base_train  = 100000.0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 0.25
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope_finetuned   = unknown
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv       = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner      = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_state      = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank      = 0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model type       = 7B
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model ftype      = Q4_0
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model params     = 6.74 B
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: general.name     = deepseek-ai
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: BOS token        = 32013 '<|begin▁of▁sentence|>'
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: EOS token        = 32021 '<|EOT|>'
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: PAD token        = 32014 '<|end▁of▁sentence|>'
Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: LF token         = 126 'Ä'
Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices:
Apr 19 16:39:06 quorra ollama[1180]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: ggml ctx size =    0.22 MiB
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloading 32 repeating layers to GPU
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloading non-repeating layers to GPU
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloaded 33/33 layers to GPU
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors:        CPU buffer size =    70.88 MiB
Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors:      CUDA0 buffer size =  3578.38 MiB
Apr 19 16:39:07 quorra ollama[1180]: ..................................................................................................
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_ctx      = 2048
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_batch    = 512
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_ubatch   = 512
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: freq_base  = 100000.0
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 0.25
Apr 19 16:39:07 quorra ollama[1180]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: graph nodes  = 1030
Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: graph splits = 2
Apr 19 16:39:07 quorra ollama[3928329]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140412729040896","timestamp":1713544747}
<!-- gh-comment-id:2066855537 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): Another run and the model "deepseek-coder:6.7b-instruct" is running on the CPU when it otherwise runs on GPU. ``` Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.043Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.138Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.138Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.139Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.147Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.147Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.173Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.184Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.184Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.185Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.209Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="5223.4 MiB" used="650.0 MiB" available="270.6 MiB" kv="1024.0 MiB" fulloffload="164.0 MiB" partialoffload="193.0 MiB" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cpu/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 0 --port 34059" Apr 19 15:43:19 quorra ollama[1180]: time=2024-04-19T15:43:19.220Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 19 15:43:19 quorra ollama[3559137]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140322160809856","timestamp":1713541399} Apr 19 15:43:19 quorra ollama[3559137]: {"function":"server_params_parse","level":"WARN","line":2380,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"140322160809856","timestamp":1713541399} Apr 19 15:43:19 quorra ollama[3559137]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140322160809856","timestamp":1713541399} Apr 19 15:43:19 quorra ollama[3559137]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140322160809856","timestamp":1713541399,"total_threads":32} Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb (version GGUF V3 (latest)) Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 0: general.architecture str = llama Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 1: general.name str = deepseek-ai Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 2: llama.context_length u32 = 16384 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 11: llama.rope.scaling.type str = linear Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 12: llama.rope.scaling.factor f32 = 4.000000 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 13: general.file_type u32 = 2 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32256] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32256] = [0.000000, 0.000000, 0.000000, 0.0000... Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,31757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 32013 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32021 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32014 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type f32: 65 tensors Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type q4_0: 225 tensors Apr 19 15:43:19 quorra ollama[1180]: llama_model_loader: - type q6_K: 1 tensors Apr 19 15:43:19 quorra ollama[1180]: llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ). Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: format = GGUF V3 (latest) Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: arch = llama Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: vocab type = BPE Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_vocab = 32256 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_merges = 31757 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_ctx_train = 16384 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd = 4096 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_head = 32 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_head_kv = 32 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_layer = 32 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_rot = 128 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k = 128 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v = 128 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_gqa = 1 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa = 4096 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa = 4096 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_ff = 11008 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_expert = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_expert_used = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: causal attn = 1 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: pooling type = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope type = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope scaling = linear Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: freq_base_train = 100000.0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 0.25 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx = 16384 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: rope_finetuned = unknown Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_d_state = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank = 0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model type = 7B Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model ftype = Q4_0 Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model params = 6.74 B Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: general.name = deepseek-ai Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>' Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: EOS token = 32021 '<|EOT|>' Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>' Apr 19 15:43:19 quorra ollama[1180]: llm_load_print_meta: LF token = 126 'Ä' Apr 19 15:43:19 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.11 MiB Apr 19 15:43:19 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 3649.25 MiB Apr 19 15:43:19 quorra ollama[1180]: .................................................................................................. Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: freq_base = 100000.0 Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 0.25 Apr 19 15:43:19 quorra ollama[1180]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: CPU output buffer size = 0.14 MiB Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: CPU compute buffer size = 164.01 MiB Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1030 Apr 19 15:43:19 quorra ollama[1180]: llama_new_context_with_model: graph splits = 1 Apr 19 15:43:20 quorra ollama[3559137]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140322160809856","timestamp":1713541400} ``` Without restarting the Ollama service. I swap the model by using ollama run and provide a different model to load. It loaded into the GPU. I then exit and run ollama run with the deep-seeker again and Ollama will load it into the GPU. Just before doing this the model was loaded onto the CPU. The only software using the GPU is Ollama. ``` Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.567Z level=INFO source=routes.go:97 msg="changing loaded model" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.654Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.654Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.656Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.657Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.657Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.767Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:121 msg="Detecting GPU type" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libcudart.so*" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: [/tmp/ollama799514660/runners/cuda_v11/libcudart.so.11.0]" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=gpu.go:126 msg="Nvidia GPU detected via cudart" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.780Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.806Z level=INFO source=gpu.go:202 msg="[cudart] CUDART CUDA Compute Capability detected: 8.9" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=33 layers=33 required="5223.4 MiB" used="5223.4 MiB" available="15857.2 MiB" kv="1024.0 MiB" fulloffload="164.0 MiB" partialoffload="193.0 MiB" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:264 msg="starting llama server" cmd="/tmp/ollama799514660/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --port 37141" Apr 19 16:39:06 quorra ollama[1180]: time=2024-04-19T16:39:06.819Z level=INFO source=server.go:389 msg="waiting for llama runner to start responding" Apr 19 16:39:06 quorra ollama[3928329]: {"function":"server_params_parse","level":"INFO","line":2603,"msg":"logging to file is disabled.","tid":"140412729040896","timestamp":1713544746} Apr 19 16:39:06 quorra ollama[3928329]: {"build":1,"commit":"7593639","function":"main","level":"INFO","line":2819,"msg":"build info","tid":"140412729040896","timestamp":1713544746} Apr 19 16:39:06 quorra ollama[3928329]: {"function":"main","level":"INFO","line":2822,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"140412729040896","timestamp":1713544746,"total_threads":32} Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-59bb50d8116b6a1f9bfbb940d6bb946a05554e591e30c8c2429ed6c854867ecb (version GGUF V3 (latest)) Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 0: general.architecture str = llama Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 1: general.name str = deepseek-ai Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 2: llama.context_length u32 = 16384 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 11: llama.rope.scaling.type str = linear Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 12: llama.rope.scaling.factor f32 = 4.000000 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 13: general.file_type u32 = 2 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32256] = ["!", "\"", "#", "$", "%", "&", "'", ... Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32256] = [0.000000, 0.000000, 0.000000, 0.0000... Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,31757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 32013 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32021 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32014 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type f32: 65 tensors Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type q4_0: 225 tensors Apr 19 16:39:06 quorra ollama[1180]: llama_model_loader: - type q6_K: 1 tensors Apr 19 16:39:06 quorra ollama[1180]: llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ). Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: format = GGUF V3 (latest) Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: arch = llama Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: vocab type = BPE Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_vocab = 32256 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_merges = 31757 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_ctx_train = 16384 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd = 4096 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_head = 32 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_head_kv = 32 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_layer = 32 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_rot = 128 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_head_k = 128 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_head_v = 128 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_gqa = 1 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_k_gqa = 4096 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_embd_v_gqa = 4096 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_norm_eps = 0.0e+00 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: f_logit_scale = 0.0e+00 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_ff = 11008 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_expert = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_expert_used = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: causal attn = 1 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: pooling type = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope type = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope scaling = linear Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: freq_base_train = 100000.0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: freq_scale_train = 0.25 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: n_yarn_orig_ctx = 16384 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: rope_finetuned = unknown Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_conv = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_inner = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_d_state = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: ssm_dt_rank = 0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model type = 7B Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model ftype = Q4_0 Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model params = 6.74 B Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: general.name = deepseek-ai Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: BOS token = 32013 '<|begin▁of▁sentence|>' Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: EOS token = 32021 '<|EOT|>' Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: PAD token = 32014 '<|end▁of▁sentence|>' Apr 19 16:39:06 quorra ollama[1180]: llm_load_print_meta: LF token = 126 'Ä' Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Apr 19 16:39:06 quorra ollama[1180]: ggml_cuda_init: found 1 CUDA devices: Apr 19 16:39:06 quorra ollama[1180]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: ggml ctx size = 0.22 MiB Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloading 32 repeating layers to GPU Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloading non-repeating layers to GPU Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: offloaded 33/33 layers to GPU Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: CPU buffer size = 70.88 MiB Apr 19 16:39:06 quorra ollama[1180]: llm_load_tensors: CUDA0 buffer size = 3578.38 MiB Apr 19 16:39:07 quorra ollama[1180]: .................................................................................................. Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_ctx = 2048 Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_batch = 512 Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: n_ubatch = 512 Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: freq_base = 100000.0 Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: freq_scale = 0.25 Apr 19 16:39:07 quorra ollama[1180]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: graph nodes = 1030 Apr 19 16:39:07 quorra ollama[1180]: llama_new_context_with_model: graph splits = 2 Apr 19 16:39:07 quorra ollama[3928329]: {"function":"initialize","level":"INFO","line":448,"msg":"initializing slots","n_slots":1,"tid":"140412729040896","timestamp":1713544747} ```
Author
Owner

@dhiltgen commented on GitHub (Apr 19, 2024):

@MarkWard0110 can you run the server with OLLAMA_DEBUG=1 and repro the CPU scenario and share the log?

<!-- gh-comment-id:2067297233 --> @dhiltgen commented on GitHub (Apr 19, 2024): @MarkWard0110 can you run the server with `OLLAMA_DEBUG=1` and repro the CPU scenario and share the log?
Author
Owner

@MarkWard0110 commented on GitHub (Apr 20, 2024):

@dhiltgen
I have attached a zip of my log file.
ollama_log.zip

<!-- gh-comment-id:2067529438 --> @MarkWard0110 commented on GitHub (Apr 20, 2024): @dhiltgen I have attached a zip of my log file. [ollama_log.zip](https://github.com/ollama/ollama/files/15046721/ollama_log.zip)
Author
Owner

@dhiltgen commented on GitHub (Apr 22, 2024):

From what I can see in the logs, it looks like it loads some models into the GPU, but then the GPU VRAM gets filled up, and it has to fall back to CPU.

...
Apr 20 00:24:19 quorra ollama[571888]: llm_load_tensors: offloaded 29/29 layers to GPU
...
Apr 20 00:27:29 quorra ollama[571888]: llm_load_tensors: offloaded 30/81 layers to GPU
...
Apr 20 02:40:19 quorra ollama[571888]: [0] CUDA totalMem 16852516864
Apr 20 02:40:19 quorra ollama[571888]: [0] CUDA freeMem 252313600
...
Apr 20 02:40:19 quorra ollama[571888]: time=2024-04-20T02:40:19.173Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="19193.1 MiB" used="805.0 MiB" available="240.6 MiB" kv="384.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB"

(that last line saying "layers=0" means it can't offload anything to the GPU since it's too full.)

Can you run nvidia-smi in another window while you're running your scenarios to see what is using up GPU VRAM? Assuming Ollama is the only heavy VRAM user, there may be a crash or orphan scenario here where we're not properly cleaning up after ourselves when switching models, and you'll see that as lots of ollama_llama_server processes running on the GPU in nvidia-smi

It may be helpful to shutdown the system service, and run a server with debug enabled if you can isolate which specific scenario causes this problem. something like: OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

I've got another PR that should merge soon which revamps a bunch of this logic to support concurrency.

<!-- gh-comment-id:2071131682 --> @dhiltgen commented on GitHub (Apr 22, 2024): From what I can see in the logs, it looks like it loads some models into the GPU, but then the GPU VRAM gets filled up, and it has to fall back to CPU. ``` ... Apr 20 00:24:19 quorra ollama[571888]: llm_load_tensors: offloaded 29/29 layers to GPU ... Apr 20 00:27:29 quorra ollama[571888]: llm_load_tensors: offloaded 30/81 layers to GPU ... Apr 20 02:40:19 quorra ollama[571888]: [0] CUDA totalMem 16852516864 Apr 20 02:40:19 quorra ollama[571888]: [0] CUDA freeMem 252313600 ... Apr 20 02:40:19 quorra ollama[571888]: time=2024-04-20T02:40:19.173Z level=INFO source=server.go:127 msg="offload to gpu" reallayers=0 layers=0 required="19193.1 MiB" used="805.0 MiB" available="240.6 MiB" kv="384.0 MiB" fulloffload="324.0 MiB" partialoffload="348.0 MiB" ``` (that last line saying "layers=0" means it can't offload anything to the GPU since it's too full.) Can you run `nvidia-smi` in another window while you're running your scenarios to see what is using up GPU VRAM? Assuming Ollama is the only heavy VRAM user, there may be a crash or orphan scenario here where we're not properly cleaning up after ourselves when switching models, and you'll see that as lots of ollama_llama_server processes running on the GPU in `nvidia-smi` It may be helpful to shutdown the system service, and run a server with debug enabled if you can isolate which specific scenario causes this problem. something like: `OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log` I've got another PR that should merge soon which revamps a bunch of this logic to support concurrency.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 23, 2024):

When I run OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log from the command line, it does not load the modules downloaded when it runs as a service.

<!-- gh-comment-id:2071161170 --> @MarkWard0110 commented on GitHub (Apr 23, 2024): When I run `OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log` from the command line, it does not load the modules downloaded when it runs as a service.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 23, 2024):

As I am testing. On my server, no other process would be using the GPU. It is only Ollama. I will observe the GPU processes and see if there is more than one instance.

<!-- gh-comment-id:2071175277 --> @MarkWard0110 commented on GitHub (Apr 23, 2024): As I am testing. On my server, no other process would be using the GPU. It is only Ollama. I will observe the GPU processes and see if there is more than one instance.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 23, 2024):

During a test I noticed the GPU process stopped and there are no GPU Ollama processes when it started a new model and it loaded as all CPU

server.log

<!-- gh-comment-id:2071299372 --> @MarkWard0110 commented on GitHub (Apr 23, 2024): During a test I noticed the GPU process stopped and there are no GPU Ollama processes when it started a new model and it loaded as all CPU [server.log](https://github.com/ollama/ollama/files/15070605/server.log)
Author
Owner

@MarkWard0110 commented on GitHub (Apr 23, 2024):

@dhiltgen , I could try out your PR. Is it pretty easy to get the source up and runing?

<!-- gh-comment-id:2072182941 --> @MarkWard0110 commented on GitHub (Apr 23, 2024): @dhiltgen , I could try out your PR. Is it pretty easy to get the source up and runing?
Author
Owner

@dhiltgen commented on GitHub (Apr 24, 2024):

@MarkWard0110 the PR is now merged to main and will be in the next release. #3418

<!-- gh-comment-id:2073704827 --> @dhiltgen commented on GitHub (Apr 24, 2024): @MarkWard0110 the PR is now merged to main and will be in the next release. #3418
Author
Owner

@MarkWard0110 commented on GitHub (Apr 24, 2024):

@dhiltgen , I have tested against main and I get the same issue where the model will sometimes load on CPU instead of GPU.

My log may have some of my initial testing where I had it built but didn't have GPU support. I later figured out how to get the GPU to build.
ollama_source_log.zip

Here is something weird I heard today. Something might be wrong with Intel Core i9 and the GPU that causes software to "see" no VRAM available. I don't know the specifics. Strangely, I didn't have this problem with previous versions of Ollama.

I can roll back to an older version for double confirmation. My previous benchmarks do not indicate this ever happened.

<!-- gh-comment-id:2073970898 --> @MarkWard0110 commented on GitHub (Apr 24, 2024): @dhiltgen , I have tested against main and I get the same issue where the model will sometimes load on CPU instead of GPU. My log may have some of my initial testing where I had it built but didn't have GPU support. I later figured out how to get the GPU to build. [ollama_source_log.zip](https://github.com/ollama/ollama/files/15087512/ollama_source_log.zip) Here is something weird I heard today. Something might be wrong with Intel Core i9 and the GPU that causes software to "see" no VRAM available. I don't know the specifics. Strangely, I didn't have this problem with previous versions of Ollama. I can roll back to an older version for double confirmation. My previous benchmarks do not indicate this ever happened.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 24, 2024):

@dhiltgen , I have checked out the tags/v0.1.31 built, and my initial testing has not been able to reproduce the bug. I will continue to run the test on the build.

<!-- gh-comment-id:2074945522 --> @MarkWard0110 commented on GitHub (Apr 24, 2024): @dhiltgen , I have checked out the `tags/v0.1.31` built, and my initial testing has not been able to reproduce the bug. I will continue to run the test on the build.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 24, 2024):

What if I tried to isolate changes in Ollama or llama.cpp by testing version v0.1.31 with llama.cpp used in the newer release and main? What is an approach to do this?

<!-- gh-comment-id:2074998660 --> @MarkWard0110 commented on GitHub (Apr 24, 2024): What if I tried to isolate changes in Ollama or llama.cpp by testing version v0.1.31 with llama.cpp used in the newer release and main? What is an approach to do this?
Author
Owner

@MarkWard0110 commented on GitHub (Apr 24, 2024):

I have a build of Ollama v0.1.31 running with the llama.cpp commit from v0.1.32 and testing it right now.

<!-- gh-comment-id:2075151135 --> @MarkWard0110 commented on GitHub (Apr 24, 2024): I have a build of Ollama v0.1.31 running with the llama.cpp commit from v0.1.32 and testing it right now.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 24, 2024):

More details about the program I am using.

The program benchmarks a fixed list of prompts against the list of models loaded in Ollama. It sorts the list of models randomly and loops through it one by one. It uses the generate API. The program will wait up to 5 minutes for the generated response.

The timeout is mostly in place for when models begin to generate and never stop.

I might try a test in which the client does not time out and awaits every response. I'll have to filter out the models that never stop.

<!-- gh-comment-id:2075305697 --> @MarkWard0110 commented on GitHub (Apr 24, 2024): More details about the program I am using. The program benchmarks a fixed list of prompts against the list of models loaded in Ollama. It sorts the list of models randomly and loops through it one by one. It uses the generate API. The program will wait up to 5 minutes for the generated response. The timeout is mostly in place for when models begin to generate and never stop. I might try a test in which the client does not time out and awaits every response. I'll have to filter out the models that never stop.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

I can run it under a debugger with breakpoints. I will attempt to observe the execution under my testing.

<!-- gh-comment-id:2077254600 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): I can run it under a debugger with breakpoints. I will attempt to observe the execution under my testing.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

@dhiltgen, I can reproduce the issue with ollama run on main branch
I have a breakpoint on memory.go slog.Debug("insufficient VRAM to load any model layers")

Start a new ollama instance, and when I perform the following, it will hit the breakpoint.

  1. ollama run phi3:3.8b
  2. /exit
  3. ollama run codellama:34b
  4. /exit
  5. ollama run phi3:3.8b

On the 5th step, it hits the breakpoint.

<!-- gh-comment-id:2077430768 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): @dhiltgen, I can reproduce the issue with `ollama run` on `main` branch I have a breakpoint on `memory.go` [slog.Debug("insufficient VRAM to load any model layers")](https://github.com/ollama/ollama/blob/3450a57d4a4a2c74892318b7f3986bab6461b2fc/llm/memory.go#L99) Start a new ollama instance, and when I perform the following, it will hit the breakpoint. 1. `ollama run phi3:3.8b` 2. /exit 3. `ollama run codellama:34b` 4. /exit 5. `ollama run phi3:3.8b` On the 5th step, it hits the breakpoint.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

@dhiltgen ,
I see there was a change in the gpu.go from v0.1.31 to v0.1.32 and I have been watching how the results of the GetGPUInfo()

The OS is Ubuntu Server and I am running Ollama under a VSCode devcontainer with GPU support. This is only for debugging. I get the issue on the native host and in the devcontainer.

start ollama serve the GetGPUInfo returns
image

in another terminal
ollama run phi3:3.8b the GetGPUInfo returns
image
/exit the chat

ollama run codellama:34b the GetGPUInfo returns
image
/exit the chat

ollama run phi3:3.8b the GetGPUInfo returns
image

This causes the system to detect not enough vram is available for offloading to the GPU.

Ollama is the only process using the GPU. The NVIDIA GPU is not used for video output. I am using the onboard CPU GPU for video output.

<!-- gh-comment-id:2077610846 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): @dhiltgen , I see there was a change in the `gpu.go` from v0.1.31 to v0.1.32 and I have been watching how the results of the `GetGPUInfo()` The OS is Ubuntu Server and I am running Ollama under a VSCode devcontainer with GPU support. This is only for debugging. I get the issue on the native host and in the devcontainer. start `ollama serve` the `GetGPUInfo` returns ![image](https://github.com/ollama/ollama/assets/90335263/06ef8061-2468-4691-afc3-00d1385f4cc1) in another terminal `ollama run phi3:3.8b` the `GetGPUInfo` returns ![image](https://github.com/ollama/ollama/assets/90335263/5f9b0766-ecad-41a2-ac4a-2ec0cd14987d) `/exit` the chat `ollama run codellama:34b` the `GetGPUInfo` returns ![image](https://github.com/ollama/ollama/assets/90335263/e4c441fd-8e78-4065-8549-de73f799ca64) `/exit` the chat `ollama run phi3:3.8b` the `GetGPUInfo` returns ![image](https://github.com/ollama/ollama/assets/90335263/e6a775c8-a6bb-4790-a303-e5d44f1b8019) This causes the system to detect not enough vram is available for offloading to the GPU. Ollama is the only process using the GPU. The NVIDIA GPU is not used for video output. I am using the onboard CPU GPU for video output.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

It appears Ollama is getting the wrong available memory each time. Maybe the model has not fully unloaded before the get GPU call is made?

I tested a situation where before calling the last ollama run phi3:3.8b I wait for Ollama to automatically unload the codellama:34b. When I saw Ollama unload the model and the GPU show available vram I ran the last ollama run phi3:3.8b and it loaded successfully.

I'm trying to find in the code where it unloads a model before loading it to try and see if there is a situation where the swapping of the models are overlapping and the new model is reading available VRAM based upon the previous model's use.

<!-- gh-comment-id:2077687919 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): It appears Ollama is getting the wrong available memory each time. Maybe the model has not fully unloaded before the get GPU call is made? I tested a situation where before calling the last `ollama run phi3:3.8b` I wait for Ollama to automatically unload the `codellama:34b`. When I saw Ollama unload the model and the GPU show available vram I ran the last `ollama run phi3:3.8b` and it loaded successfully. I'm trying to find in the code where it unloads a model before loading it to try and see if there is a situation where the swapping of the models are overlapping and the new model is reading available VRAM based upon the previous model's use.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

If I have breakpoints in the code in the areas where the runner is unloaded and right at the beginning of the initGPUHandles() I can't reproduce the issue. I think this is because I have slowed down the program giving enough time for the vram state to change before it calls initGPUHandles()

It might be hard to reproduce this issue under a debugger if it is a timing issue. It might be a bug that is not seen on slower processors.

<!-- gh-comment-id:2078010759 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): If I have breakpoints in the code in the areas where the runner is unloaded and right at the beginning of the `initGPUHandles()` I can't reproduce the issue. I think this is because I have slowed down the program giving enough time for the vram state to change before it calls `initGPUHandles()` It might be hard to reproduce this issue under a debugger if it is a timing issue. It might be a bug that is not seen on slower processors.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

Having the breakpoint in func initGPUHandles() *handles { at the beginning gives a delay between unloading and loading. With the delay every call to func initGPUHandles() sees the full amount of GPU memory available.

<!-- gh-comment-id:2078026738 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): Having the breakpoint in `func initGPUHandles() *handles {` at the beginning gives a delay between unloading and loading. With the delay every call to `func initGPUHandles()` sees the full amount of GPU memory available.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 25, 2024):

@dhiltgen ,

I made the following change to server.go Close() where it waits for the process to exit. This does not reproduce the issue.

func (s *llmServer) Close() error {
	if s.cmd != nil {
		slog.Debug("stopping llama server")
		if err := s.cmd.Process.Kill(); err != nil {
			return err
		}
		return s.cmd.Wait()
	}

	return nil
}
<!-- gh-comment-id:2078131191 --> @MarkWard0110 commented on GitHub (Apr 25, 2024): @dhiltgen , I made the following change to `server.go` `Close()` where it waits for the process to exit. This does not reproduce the issue. ``` func (s *llmServer) Close() error { if s.cmd != nil { slog.Debug("stopping llama server") if err := s.cmd.Process.Kill(); err != nil { return err } return s.cmd.Wait() } return nil } ```
Author
Owner

@MarkWard0110 commented on GitHub (Apr 26, 2024):

I hit the issue again. This may not be the fix.

<!-- gh-comment-id:2079742533 --> @MarkWard0110 commented on GitHub (Apr 26, 2024): I hit the issue again. This may not be the fix.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 26, 2024):

I got the same error when the Ollama server runner runner expired event received.

<!-- gh-comment-id:2080183971 --> @MarkWard0110 commented on GitHub (Apr 26, 2024): I got the same error when the Ollama server runner `runner expired event received`.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 27, 2024):

I think the timer expired, expiring to unload is an interesting situation when I am running generations that are right at the 5 minute mark. The default session duration is 5 minutes. I understand the client could provide a keep-a-live value. I think the same race condition that was happening in the reload is happening here also.

I see the loading and unloading appear to be handled in different Go routines. Perhaps there needs to be a bit more synchronization between the loading and unloading? Especially when the number of runners is 1, at maximum limit, or pending is replacing an existing runner. To avoid pending from getting dirty memory readings when loading.

<!-- gh-comment-id:2080935865 --> @MarkWard0110 commented on GitHub (Apr 27, 2024): I think the `timer expired, expiring to unload` is an interesting situation when I am running generations that are right at the 5 minute mark. The default session duration is 5 minutes. I understand the client could provide a keep-a-live value. I think the same race condition that was happening in the reload is happening here also. I see the loading and unloading appear to be handled in different Go routines. Perhaps there needs to be a bit more synchronization between the loading and unloading? Especially when the number of runners is 1, at maximum limit, or pending is replacing an existing runner. To avoid pending from getting dirty memory readings when loading.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 27, 2024):

I'm thinking the second problem is happening when a new pending arrives just as the processCompleted is handling a finishedReqCh

<!-- gh-comment-id:2081190833 --> @MarkWard0110 commented on GitHub (Apr 27, 2024): I'm thinking the second problem is happening when a new pending arrives just as the `processCompleted` is handling a `finishedReqCh`
Author
Owner

@MarkWard0110 commented on GitHub (Apr 27, 2024):

I think I have fixed the second problem by clearing the runner.expireTime when pending is assigned an existing runner. Avoid "renewing" the timer which could fire if a model generation happened to be long enough for it to trigger.

Runing through my testing. What I have to figure out is how to build unit tests that confirm the issues and validate the changes.

<!-- gh-comment-id:2081251422 --> @MarkWard0110 commented on GitHub (Apr 27, 2024): I think I have fixed the second problem by clearing the runner.expireTime when pending is assigned an existing runner. Avoid "renewing" the timer which could fire if a model generation happened to be long enough for it to trigger. Runing through my testing. What I have to figure out is how to build unit tests that confirm the issues and validate the changes.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 28, 2024):

@dhiltgen , I have forked and have the following branch holding the changes I have made that are currently working for me.

https://github.com/MarkWard0110/fork.ollama/tree/fix/issue-3736

<!-- gh-comment-id:2081563459 --> @MarkWard0110 commented on GitHub (Apr 28, 2024): @dhiltgen , I have forked and have the following branch holding the changes I have made that are currently working for me. https://github.com/MarkWard0110/fork.ollama/tree/fix/issue-3736
Author
Owner

@dhiltgen commented on GitHub (Apr 28, 2024):

@MarkWard0110 thanks for digging into this one! I'd say go ahead and submit a PR once you think you've fixed it or are close and lets continue the conversation on the PR.

We just shipped the first RC for 0.1.33, but it sounds like this is close to fixed, so we may be able to to get it in for the final 0.1.33.

<!-- gh-comment-id:2081583780 --> @dhiltgen commented on GitHub (Apr 28, 2024): @MarkWard0110 thanks for digging into this one! I'd say go ahead and submit a PR once you think you've fixed it or are close and lets continue the conversation on the PR. We just shipped the first RC for 0.1.33, but it sounds like this is close to fixed, so we may be able to to get it in for the final 0.1.33.
Author
Owner

@MarkWard0110 commented on GitHub (Apr 29, 2024):

I have created the pull request https://github.com/ollama/ollama/pull/4031

<!-- gh-comment-id:2083103873 --> @MarkWard0110 commented on GitHub (Apr 29, 2024): I have created the pull request https://github.com/ollama/ollama/pull/4031
Author
Owner

@dhiltgen commented on GitHub (May 1, 2024):

#4031 is merged and will be in the final 0.1.33 release.

<!-- gh-comment-id:2088953854 --> @dhiltgen commented on GitHub (May 1, 2024): #4031 is merged and will be in the final 0.1.33 release.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48813