[GH-ISSUE #4799] ollama(commits: d4a8610) run deepseek-v2:16b Error: llama runner process has terminated: signal: aborted (core dumped) #65065

Closed
opened 2026-05-03 19:40:44 -05:00 by GiteaMirror · 22 comments
Owner

Originally created by @zhqfdn on GitHub (Jun 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4799

What is the issue?

Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type f32: 108 tensors
Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type q4_0: 268 tensors
Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type q6_K: 1 tensors
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_vocab: special tokens cache size = 2400
Jun 04 00:46:13 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:13.129+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_vocab: token to piece cache size = 1.3318 MB
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: format = GGUF V3 (latest)
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: arch = deepseek2
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: vocab type = BPE
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_vocab = 102400
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_merges = 99757
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ctx_train = 163840
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd = 2048
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_head = 16
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_head_kv = 16
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_layer = 27
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_rot = 64
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_head_k = 192
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_head_v = 128
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_gqa = 1
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_k_gqa = 3072
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_v_gqa = 2048
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ff = 10944
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert = 64
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert_used = 6
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: causal attn = 1
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: pooling type = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope type = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope scaling = yarn
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: freq_base_train = 10000.0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: freq_scale_train = 0.025
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_yarn_orig_ctx = 4096
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope_finetuned = unknown
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_conv = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_inner = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_state = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_dt_rank = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model type = 16B
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model ftype = Q4_0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model params = 15.71 B
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model size = 8.29 GiB (4.53 BPW)
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: general.name = DeepSeek-V2-Lite-Chat
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>'
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>'
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>'
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: LF token = 126 'Ä'
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_layer_dense_lead = 1
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_lora_q = 0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_lora_kv = 512
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ff_exp = 1408
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert_shared = 2
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: expert_weights_scale = 1.0
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope_yarn_log_mul = 0.0707
Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: found 1 CUDA devices:
Jun 04 00:46:13 localhost.localdomain ollama[114642]: Device 0: Tesla T4, compute capability 7.5, VMM: yes
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: ggml ctx size = 0.35 MiB
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloading 27 repeating layers to GPU
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloading non-repeating layers to GPU
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloaded 28/28 layers to GPU
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: CPU buffer size = 112.50 MiB
Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: CUDA0 buffer size = 8376.27 MiB
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_ctx = 20480
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_batch = 512
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_ubatch = 512
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: flash_attn = 1
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: freq_base = 10000.0
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: freq_scale = 0.025
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_kv_cache_init: CUDA0 KV buffer size = 5400.00 MiB
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: KV self size = 5400.00 MiB, K (f16): 3240.00 MiB, V (f16): 2160.00 MiB
Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB
Jun 04 00:46:15 localhost.localdomain ollama[114642]: GGML_ASSERT: /home/tools/ollama/llm/llama.cpp/ggml.c:5714: ggml_nelements(a) == ne0*ne1
Jun 04 00:46:15 localhost.localdomain systemd-coredump[114889]: [🡕] Process 114887 (ollama_llama_se) of user 996 dumped core.
Jun 04 00:46:15 localhost.localdomain systemd-coredump[114896]: [🡕] Process 114812 (ollama_llama_se) of user 996 dumped core.
Jun 04 00:46:15 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:15.954+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
Jun 04 00:46:16 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:16.205+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
Jun 04 00:46:16 localhost.localdomain ollama[114642]: [GIN] 2024/06/04 - 00:46:16 | 500 | 5.505031311s | 10.10.11.11 | POST "/api/chat"
Jun 04 00:46:21 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:21.896+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6909691559999995
Jun 04 00:46:23 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:23.229+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=7.023871504
Jun 04 00:46:24 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:24.409+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=8.204406674


localhost.localdomain Tue Jun 4 00:48:30 2024 550.54.15
[0] Tesla T4 | 43°C, 0 % | 2642 / 15360 MB | ollama/114698(2640M)
[1] Tesla T4 | 34°C, 0 % | 2 / 15360 MB |
[2] Tesla T4 | 32°C, 0 % | 2 / 15360 MB |
[3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB |

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

d4a8610

Originally created by @zhqfdn on GitHub (Jun 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4799 ### What is the issue? Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type f32: 108 tensors Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type q4_0: 268 tensors Jun 04 00:46:12 localhost.localdomain ollama[114642]: llama_model_loader: - type q6_K: 1 tensors Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_vocab: special tokens cache size = 2400 Jun 04 00:46:13 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:13.129+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_vocab: token to piece cache size = 1.3318 MB Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: format = GGUF V3 (latest) Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: arch = deepseek2 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: vocab type = BPE Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_vocab = 102400 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_merges = 99757 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ctx_train = 163840 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd = 2048 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_head = 16 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_head_kv = 16 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_layer = 27 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_rot = 64 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_head_k = 192 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_head_v = 128 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_gqa = 1 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_k_gqa = 3072 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_embd_v_gqa = 2048 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ff = 10944 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert = 64 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert_used = 6 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: causal attn = 1 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: pooling type = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope type = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope scaling = yarn Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: freq_base_train = 10000.0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: freq_scale_train = 0.025 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_yarn_orig_ctx = 4096 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope_finetuned = unknown Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_conv = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_inner = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_d_state = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: ssm_dt_rank = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model type = 16B Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model ftype = Q4_0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model params = 15.71 B Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: model size = 8.29 GiB (4.53 BPW) Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: general.name = DeepSeek-V2-Lite-Chat Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: LF token = 126 'Ä' Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_layer_dense_lead = 1 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_lora_q = 0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_lora_kv = 512 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_ff_exp = 1408 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: n_expert_shared = 2 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: expert_weights_scale = 1.0 Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_print_meta: rope_yarn_log_mul = 0.0707 Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Jun 04 00:46:13 localhost.localdomain ollama[114642]: ggml_cuda_init: found 1 CUDA devices: Jun 04 00:46:13 localhost.localdomain ollama[114642]: Device 0: Tesla T4, compute capability 7.5, VMM: yes Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: ggml ctx size = 0.35 MiB Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloading 27 repeating layers to GPU Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloading non-repeating layers to GPU Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: offloaded 28/28 layers to GPU Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: CPU buffer size = 112.50 MiB Jun 04 00:46:13 localhost.localdomain ollama[114642]: llm_load_tensors: CUDA0 buffer size = 8376.27 MiB Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_ctx = 20480 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_batch = 512 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: n_ubatch = 512 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: flash_attn = 1 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: freq_base = 10000.0 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: freq_scale = 0.025 Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_kv_cache_init: CUDA0 KV buffer size = 5400.00 MiB Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: KV self size = 5400.00 MiB, K (f16): 3240.00 MiB, V (f16): 2160.00 MiB Jun 04 00:46:15 localhost.localdomain ollama[114642]: llama_new_context_with_model: CUDA_Host output buffer size = 3.98 MiB Jun 04 00:46:15 localhost.localdomain ollama[114642]: GGML_ASSERT: /home/tools/ollama/llm/llama.cpp/ggml.c:5714: ggml_nelements(a) == ne0*ne1 Jun 04 00:46:15 localhost.localdomain systemd-coredump[114889]: [🡕] Process 114887 (ollama_llama_se) of user 996 dumped core. Jun 04 00:46:15 localhost.localdomain systemd-coredump[114896]: [🡕] Process 114812 (ollama_llama_se) of user 996 dumped core. Jun 04 00:46:15 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:15.954+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" Jun 04 00:46:16 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:16.205+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " Jun 04 00:46:16 localhost.localdomain ollama[114642]: [GIN] 2024/06/04 - 00:46:16 | 500 | 5.505031311s | 10.10.11.11 | POST "/api/chat" Jun 04 00:46:21 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:21.896+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6909691559999995 Jun 04 00:46:23 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:23.229+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=7.023871504 Jun 04 00:46:24 localhost.localdomain ollama[114642]: time=2024-06-04T00:46:24.409+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=8.204406674 ------------------------------------------------------------- localhost.localdomain Tue Jun 4 00:48:30 2024 550.54.15 [0] Tesla T4 | 43°C, 0 % | 2642 / 15360 MB | ollama/114698(2640M) [1] Tesla T4 | 34°C, 0 % | 2 / 15360 MB | [2] Tesla T4 | 32°C, 0 % | 2 / 15360 MB | [3] Tesla T4 | 33°C, 0 % | 2 / 15360 MB | ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version d4a8610
GiteaMirror added the bug label 2026-05-03 19:40:44 -05:00
Author
Owner

@lyatcn commented on GitHub (Jun 4, 2024):

Note: this requires Ollama 0.1.40 which is in pre-release.

<!-- gh-comment-id:2147167405 --> @lyatcn commented on GitHub (Jun 4, 2024): Note: this requires [Ollama 0.1.40](https://github.com/ollama/ollama/releases/tag/v0.1.40) which is in pre-release.
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

Seeing the same problem with 0.1.40:

$ ollama --version
ollama version is 0.1.40
$ ollama run deepseek-v2:16b-lite-chat-q4_K_S
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81' 
$ docker compose logs ollama
ollama-1  | 2024/06/04 21:24:18 routes.go:1007: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
ollama-1  | time=2024-06-04T21:24:18.764Z level=INFO source=images.go:729 msg="total blobs: 1131"
ollama-1  | time=2024-06-04T21:24:19.042Z level=INFO source=images.go:736 msg="total unused blobs removed: 0"
ollama-1  | time=2024-06-04T21:24:19.094Z level=INFO source=routes.go:1053 msg="Listening on [::]:11434 (version 0.1.40)"
ollama-1  | time=2024-06-04T21:24:19.095Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1744626780/runners
ollama-1  | time=2024-06-04T21:24:21.386Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]"
ollama-1  | time=2024-06-04T21:24:21.465Z level=INFO source=types.go:71 msg="inference compute" id=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 library=cuda compute=8.9 driver=12.0 name="NVIDIA GeForce RTX 4070" total="11.7 GiB" available="9.3 GiB"
ollama-1  | [GIN] 2024/06/04 - 21:25:14 | 200 |      14.984µs |       127.0.0.1 | HEAD     "/"
ollama-1  | [GIN] 2024/06/04 - 21:25:14 | 200 |    1.561912ms |       127.0.0.1 | POST     "/api/show"
ollama-1  | [GIN] 2024/06/04 - 21:25:14 | 200 |     1.19434ms |       127.0.0.1 | POST     "/api/show"
ollama-1  | time=2024-06-04T21:25:14.690Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB"
ollama-1  | time=2024-06-04T21:25:14.690Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB"
ollama-1  | time=2024-06-04T21:25:14.691Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB"
ollama-1  | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama1744626780/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 26 --parallel 1 --port 44605"
ollama-1  | time=2024-06-04T21:25:14.691Z level=INFO source=sched.go:338 msg="loaded runners" count=1
ollama-1  | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
ollama-1  | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
ollama-1  | INFO [main] build info | build=1 commit="5921b8f" tid="139987701374976" timestamp=1717536314
ollama-1  | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139987701374976" timestamp=1717536314 total_threads=24
ollama-1  | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44605" tid="139987701374976" timestamp=1717536314
ollama-1  | llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81 (version GGUF V3 (latest))
ollama-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama-1  | llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
ollama-1  | llama_model_loader: - kv   1:                               general.name str              = DeepSeek-V2-Lite-Chat
ollama-1  | llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
ollama-1  | llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
ollama-1  | llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
ollama-1  | llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
ollama-1  | llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
ollama-1  | llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
ollama-1  | llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
ollama-1  | llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
ollama-1  | llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
ollama-1  | llama_model_loader: - kv  11:                          general.file_type u32              = 14
ollama-1  | llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
ollama-1  | llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
ollama-1  | llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
ollama-1  | llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
ollama-1  | llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
ollama-1  | llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
ollama-1  | llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
ollama-1  | llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
ollama-1  | llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
ollama-1  | llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
ollama-1  | llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
ollama-1  | llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
ollama-1  | llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
ollama-1  | llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
ollama-1  | llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
ollama-1  | llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
ollama-1  | llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
ollama-1  | llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama-1  | llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
ollama-1  | llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
ollama-1  | llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
ollama-1  | llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
ollama-1  | llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
ollama-1  | llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
ollama-1  | llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
ollama-1  | llama_model_loader: - kv  37:               general.quantization_version u32              = 2
ollama-1  | llama_model_loader: - type  f32:  108 tensors
ollama-1  | llama_model_loader: - type q5_0:   24 tensors
ollama-1  | llama_model_loader: - type q5_1:    3 tensors
ollama-1  | llama_model_loader: - type q4_K:  239 tensors
ollama-1  | llama_model_loader: - type q5_K:    2 tensors
ollama-1  | llama_model_loader: - type q6_K:    1 tensors
ollama-1  | llm_load_vocab: special tokens cache size = 2400
ollama-1  | time=2024-06-04T21:25:14.943Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
ollama-1  | llm_load_vocab: token to piece cache size = 1.3318 MB
ollama-1  | llm_load_print_meta: format           = GGUF V3 (latest)
ollama-1  | llm_load_print_meta: arch             = deepseek2
ollama-1  | llm_load_print_meta: vocab type       = BPE
ollama-1  | llm_load_print_meta: n_vocab          = 102400
ollama-1  | llm_load_print_meta: n_merges         = 99757
ollama-1  | llm_load_print_meta: n_ctx_train      = 163840
ollama-1  | llm_load_print_meta: n_embd           = 2048
ollama-1  | llm_load_print_meta: n_head           = 16
ollama-1  | llm_load_print_meta: n_head_kv        = 16
ollama-1  | llm_load_print_meta: n_layer          = 27
ollama-1  | llm_load_print_meta: n_rot            = 64
ollama-1  | llm_load_print_meta: n_embd_head_k    = 192
ollama-1  | llm_load_print_meta: n_embd_head_v    = 128
ollama-1  | llm_load_print_meta: n_gqa            = 1
ollama-1  | llm_load_print_meta: n_embd_k_gqa     = 3072
ollama-1  | llm_load_print_meta: n_embd_v_gqa     = 2048
ollama-1  | llm_load_print_meta: f_norm_eps       = 0.0e+00
ollama-1  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
ollama-1  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
ollama-1  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama-1  | llm_load_print_meta: f_logit_scale    = 0.0e+00
ollama-1  | llm_load_print_meta: n_ff             = 10944
ollama-1  | llm_load_print_meta: n_expert         = 64
ollama-1  | llm_load_print_meta: n_expert_used    = 6
ollama-1  | llm_load_print_meta: causal attn      = 1
ollama-1  | llm_load_print_meta: pooling type     = 0
ollama-1  | llm_load_print_meta: rope type        = 0
ollama-1  | llm_load_print_meta: rope scaling     = yarn
ollama-1  | llm_load_print_meta: freq_base_train  = 10000.0
ollama-1  | llm_load_print_meta: freq_scale_train = 0.025
ollama-1  | llm_load_print_meta: n_yarn_orig_ctx  = 4096
ollama-1  | llm_load_print_meta: rope_finetuned   = unknown
ollama-1  | llm_load_print_meta: ssm_d_conv       = 0
ollama-1  | llm_load_print_meta: ssm_d_inner      = 0
ollama-1  | llm_load_print_meta: ssm_d_state      = 0
ollama-1  | llm_load_print_meta: ssm_dt_rank      = 0
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Small
ollama-1  | llm_load_print_meta: model params     = 15.71 B
ollama-1  | llm_load_print_meta: model size       = 8.88 GiB (4.85 BPW) 
ollama-1  | llm_load_print_meta: general.name     = DeepSeek-V2-Lite-Chat
ollama-1  | llm_load_print_meta: BOS token        = 100000 '<|begin▁of▁sentence|>'
ollama-1  | llm_load_print_meta: EOS token        = 100001 '<|end▁of▁sentence|>'
ollama-1  | llm_load_print_meta: PAD token        = 100001 '<|end▁of▁sentence|>'
ollama-1  | llm_load_print_meta: LF token         = 126 'Ä'
ollama-1  | llm_load_print_meta: n_layer_dense_lead   = 1
ollama-1  | llm_load_print_meta: n_lora_q             = 0
ollama-1  | llm_load_print_meta: n_lora_kv            = 512
ollama-1  | llm_load_print_meta: n_ff_exp             = 1408
ollama-1  | llm_load_print_meta: n_expert_shared      = 2
ollama-1  | llm_load_print_meta: expert_weights_scale = 1.0
ollama-1  | llm_load_print_meta: rope_yarn_log_mul    = 0.0707
ollama-1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ollama-1  | ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ollama-1  | ggml_cuda_init: found 1 CUDA devices:
ollama-1  |   Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
ollama-1  | llm_load_tensors: ggml ctx size =    0.35 MiB
ollama-1  | llm_load_tensors: offloading 26 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 26/28 layers to GPU
ollama-1  | llm_load_tensors:        CPU buffer size =  2381.68 MiB
ollama-1  | llm_load_tensors:      CUDA0 buffer size =  8764.11 MiB
ollama-1  | llama_new_context_with_model: n_ctx      = 2048
ollama-1  | llama_new_context_with_model: n_batch    = 512
ollama-1  | llama_new_context_with_model: n_ubatch   = 512
ollama-1  | llama_new_context_with_model: flash_attn = 0
ollama-1  | llama_new_context_with_model: freq_base  = 10000.0
ollama-1  | llama_new_context_with_model: freq_scale = 0.025
ollama-1  | llama_kv_cache_init:  CUDA_Host KV buffer size =    20.00 MiB
ollama-1  | llama_kv_cache_init:      CUDA0 KV buffer size =   520.00 MiB
ollama-1  | llama_new_context_with_model: KV self size  =  540.00 MiB, K (f16):  324.00 MiB, V (f16):  216.00 MiB
ollama-1  | llama_new_context_with_model:  CUDA_Host  output buffer size =     0.40 MiB
ollama-1  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 376.06 MiB on device 0: cudaMalloc failed: out of memory
ollama-1  | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 394330112
ollama-1  | llama_new_context_with_model: failed to allocate compute buffers
ollama-1  | llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81'
ollama-1  | ERROR [load_model] unable to load model | model="/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81" tid="139987701374976" timestamp=1717536316
ollama-1  | terminate called without an active exception
ollama-1  | time=2024-06-04T21:25:16.818Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding"
ollama-1  | time=2024-06-04T21:25:18.423Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81'"
ollama-1  | [GIN] 2024/06/04 - 21:25:18 | 500 |  4.297364286s |       127.0.0.1 | POST     "/api/chat"
ollama-1  | time=2024-06-04T21:25:23.481Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057297201
ollama-1  | time=2024-06-04T21:25:23.730Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.306755277
ollama-1  | time=2024-06-04T21:25:23.979Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.556203795

Tried varying context window and quantization level but always fails with a cudaMalloc error.

<!-- gh-comment-id:2148433840 --> @rick-github commented on GitHub (Jun 4, 2024): Seeing the same problem with 0.1.40: ``` $ ollama --version ollama version is 0.1.40 $ ollama run deepseek-v2:16b-lite-chat-q4_K_S Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81' ``` ``` $ docker compose logs ollama ollama-1 | 2024/06/04 21:24:18 routes.go:1007: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" ollama-1 | time=2024-06-04T21:24:18.764Z level=INFO source=images.go:729 msg="total blobs: 1131" ollama-1 | time=2024-06-04T21:24:19.042Z level=INFO source=images.go:736 msg="total unused blobs removed: 0" ollama-1 | time=2024-06-04T21:24:19.094Z level=INFO source=routes.go:1053 msg="Listening on [::]:11434 (version 0.1.40)" ollama-1 | time=2024-06-04T21:24:19.095Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1744626780/runners ollama-1 | time=2024-06-04T21:24:21.386Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v11 rocm_v60002 cpu cpu_avx cpu_avx2]" ollama-1 | time=2024-06-04T21:24:21.465Z level=INFO source=types.go:71 msg="inference compute" id=GPU-b5d7e56c-4491-8eeb-cb2d-e8d8424e5bb7 library=cuda compute=8.9 driver=12.0 name="NVIDIA GeForce RTX 4070" total="11.7 GiB" available="9.3 GiB" ollama-1 | [GIN] 2024/06/04 - 21:25:14 | 200 | 14.984µs | 127.0.0.1 | HEAD "/" ollama-1 | [GIN] 2024/06/04 - 21:25:14 | 200 | 1.561912ms | 127.0.0.1 | POST "/api/show" ollama-1 | [GIN] 2024/06/04 - 21:25:14 | 200 | 1.19434ms | 127.0.0.1 | POST "/api/show" ollama-1 | time=2024-06-04T21:25:14.690Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB" ollama-1 | time=2024-06-04T21:25:14.690Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB" ollama-1 | time=2024-06-04T21:25:14.691Z level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=26 memory.available="9.3 GiB" memory.required.full="9.8 GiB" memory.required.partial="9.2 GiB" memory.required.kv="432.0 MiB" memory.weights.total="8.8 GiB" memory.weights.repeating="8.6 GiB" memory.weights.nonrepeating="164.1 MiB" memory.graph.full="72.0 MiB" memory.graph.partial="72.0 MiB" ollama-1 | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:341 msg="starting llama server" cmd="/tmp/ollama1744626780/runners/cuda_v11/ollama_llama_server --model /root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 26 --parallel 1 --port 44605" ollama-1 | time=2024-06-04T21:25:14.691Z level=INFO source=sched.go:338 msg="loaded runners" count=1 ollama-1 | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:529 msg="waiting for llama runner to start responding" ollama-1 | time=2024-06-04T21:25:14.691Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" ollama-1 | INFO [main] build info | build=1 commit="5921b8f" tid="139987701374976" timestamp=1717536314 ollama-1 | INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139987701374976" timestamp=1717536314 total_threads=24 ollama-1 | INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44605" tid="139987701374976" timestamp=1717536314 ollama-1 | llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from /root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81 (version GGUF V3 (latest)) ollama-1 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama-1 | llama_model_loader: - kv 0: general.architecture str = deepseek2 ollama-1 | llama_model_loader: - kv 1: general.name str = DeepSeek-V2-Lite-Chat ollama-1 | llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 ollama-1 | llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 ollama-1 | llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 ollama-1 | llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 ollama-1 | llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 ollama-1 | llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 ollama-1 | llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 ollama-1 | llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 ollama-1 | llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 ollama-1 | llama_model_loader: - kv 11: general.file_type u32 = 14 ollama-1 | llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 ollama-1 | llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 ollama-1 | llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 ollama-1 | llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 ollama-1 | llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 ollama-1 | llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 ollama-1 | llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 ollama-1 | llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 ollama-1 | llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 ollama-1 | llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 ollama-1 | llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn ollama-1 | llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 ollama-1 | llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 ollama-1 | llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 ollama-1 | llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 ollama-1 | llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm ollama-1 | llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama-1 | llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama-1 | llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... ollama-1 | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 ollama-1 | llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 ollama-1 | llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 ollama-1 | llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true ollama-1 | llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false ollama-1 | llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is de... ollama-1 | llama_model_loader: - kv 37: general.quantization_version u32 = 2 ollama-1 | llama_model_loader: - type f32: 108 tensors ollama-1 | llama_model_loader: - type q5_0: 24 tensors ollama-1 | llama_model_loader: - type q5_1: 3 tensors ollama-1 | llama_model_loader: - type q4_K: 239 tensors ollama-1 | llama_model_loader: - type q5_K: 2 tensors ollama-1 | llama_model_loader: - type q6_K: 1 tensors ollama-1 | llm_load_vocab: special tokens cache size = 2400 ollama-1 | time=2024-06-04T21:25:14.943Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" ollama-1 | llm_load_vocab: token to piece cache size = 1.3318 MB ollama-1 | llm_load_print_meta: format = GGUF V3 (latest) ollama-1 | llm_load_print_meta: arch = deepseek2 ollama-1 | llm_load_print_meta: vocab type = BPE ollama-1 | llm_load_print_meta: n_vocab = 102400 ollama-1 | llm_load_print_meta: n_merges = 99757 ollama-1 | llm_load_print_meta: n_ctx_train = 163840 ollama-1 | llm_load_print_meta: n_embd = 2048 ollama-1 | llm_load_print_meta: n_head = 16 ollama-1 | llm_load_print_meta: n_head_kv = 16 ollama-1 | llm_load_print_meta: n_layer = 27 ollama-1 | llm_load_print_meta: n_rot = 64 ollama-1 | llm_load_print_meta: n_embd_head_k = 192 ollama-1 | llm_load_print_meta: n_embd_head_v = 128 ollama-1 | llm_load_print_meta: n_gqa = 1 ollama-1 | llm_load_print_meta: n_embd_k_gqa = 3072 ollama-1 | llm_load_print_meta: n_embd_v_gqa = 2048 ollama-1 | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama-1 | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 ollama-1 | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama-1 | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama-1 | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama-1 | llm_load_print_meta: n_ff = 10944 ollama-1 | llm_load_print_meta: n_expert = 64 ollama-1 | llm_load_print_meta: n_expert_used = 6 ollama-1 | llm_load_print_meta: causal attn = 1 ollama-1 | llm_load_print_meta: pooling type = 0 ollama-1 | llm_load_print_meta: rope type = 0 ollama-1 | llm_load_print_meta: rope scaling = yarn ollama-1 | llm_load_print_meta: freq_base_train = 10000.0 ollama-1 | llm_load_print_meta: freq_scale_train = 0.025 ollama-1 | llm_load_print_meta: n_yarn_orig_ctx = 4096 ollama-1 | llm_load_print_meta: rope_finetuned = unknown ollama-1 | llm_load_print_meta: ssm_d_conv = 0 ollama-1 | llm_load_print_meta: ssm_d_inner = 0 ollama-1 | llm_load_print_meta: ssm_d_state = 0 ollama-1 | llm_load_print_meta: ssm_dt_rank = 0 ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Small ollama-1 | llm_load_print_meta: model params = 15.71 B ollama-1 | llm_load_print_meta: model size = 8.88 GiB (4.85 BPW) ollama-1 | llm_load_print_meta: general.name = DeepSeek-V2-Lite-Chat ollama-1 | llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' ollama-1 | llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' ollama-1 | llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' ollama-1 | llm_load_print_meta: LF token = 126 'Ä' ollama-1 | llm_load_print_meta: n_layer_dense_lead = 1 ollama-1 | llm_load_print_meta: n_lora_q = 0 ollama-1 | llm_load_print_meta: n_lora_kv = 512 ollama-1 | llm_load_print_meta: n_ff_exp = 1408 ollama-1 | llm_load_print_meta: n_expert_shared = 2 ollama-1 | llm_load_print_meta: expert_weights_scale = 1.0 ollama-1 | llm_load_print_meta: rope_yarn_log_mul = 0.0707 ollama-1 | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ollama-1 | ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ollama-1 | ggml_cuda_init: found 1 CUDA devices: ollama-1 | Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes ollama-1 | llm_load_tensors: ggml ctx size = 0.35 MiB ollama-1 | llm_load_tensors: offloading 26 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 26/28 layers to GPU ollama-1 | llm_load_tensors: CPU buffer size = 2381.68 MiB ollama-1 | llm_load_tensors: CUDA0 buffer size = 8764.11 MiB ollama-1 | llama_new_context_with_model: n_ctx = 2048 ollama-1 | llama_new_context_with_model: n_batch = 512 ollama-1 | llama_new_context_with_model: n_ubatch = 512 ollama-1 | llama_new_context_with_model: flash_attn = 0 ollama-1 | llama_new_context_with_model: freq_base = 10000.0 ollama-1 | llama_new_context_with_model: freq_scale = 0.025 ollama-1 | llama_kv_cache_init: CUDA_Host KV buffer size = 20.00 MiB ollama-1 | llama_kv_cache_init: CUDA0 KV buffer size = 520.00 MiB ollama-1 | llama_new_context_with_model: KV self size = 540.00 MiB, K (f16): 324.00 MiB, V (f16): 216.00 MiB ollama-1 | llama_new_context_with_model: CUDA_Host output buffer size = 0.40 MiB ollama-1 | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 376.06 MiB on device 0: cudaMalloc failed: out of memory ollama-1 | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 394330112 ollama-1 | llama_new_context_with_model: failed to allocate compute buffers ollama-1 | llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81' ollama-1 | ERROR [load_model] unable to load model | model="/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81" tid="139987701374976" timestamp=1717536316 ollama-1 | terminate called without an active exception ollama-1 | time=2024-06-04T21:25:16.818Z level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server not responding" ollama-1 | time=2024-06-04T21:25:18.423Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81'" ollama-1 | [GIN] 2024/06/04 - 21:25:18 | 500 | 4.297364286s | 127.0.0.1 | POST "/api/chat" ollama-1 | time=2024-06-04T21:25:23.481Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.057297201 ollama-1 | time=2024-06-04T21:25:23.730Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.306755277 ollama-1 | time=2024-06-04T21:25:23.979Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.556203795 ``` Tried varying context window and quantization level but always fails with a cudaMalloc error.
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

Let me amend my last statement: at least one quantization level (q4_0) works, the others fail:

$ for i in $(docker compose exec -it ollama ollama list | grep deepseek-v2:16b | awk '{print $1}') ; do echo $i ; docker compose exec -it ollama ollama run $i 'why is the sky blue?'
 ; done
deepseek-v2:16b-lite-chat-q4_K_M
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1fb74e7c7b5a6b355a7b9472b6d284e8375dfd9ead2159e7b5a8d07b6b3e390e'
deepseek-v2:16b-lite-chat-q4_K_S
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81'
deepseek-v2:16b-lite-chat-q4_1
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-c82ca62f3aece3d0f785cde7c173234c54be8672bca1b2562b91dd806ca3406b'
deepseek-v2:16b-lite-chat-q4_0
1. Scattering of light: The reason behind the color of the sky at daytime is due to a phenomenon called Rayleigh scattering, which occurs when sunlight passes through Earth's atmosphere. In this process, 
shorter wavelengths such as violet and blue are scattered more than longer ones like red and orange in all directions by air molecules or small particles present in the atmosphere. 

2. Sunlight refraction: As sunlight enters our atmosphere, it encounters numerous air molecules which refract (bend) its path slightly. This bending of light breaks up the continuous spectrum into smaller 
wavelengths due to which blue light appears more prominent compared with other colors during daytime when sun is higher and away from horizon. 

3. Sun's position: At sunrise or sunset, longer wavelength red and orange rays are scattered out of our line of sight by earth’s atmosphere because they don’t get bent much upon entering it; hence these 
become predominant making the sky appear reddish-orange instead during those times due to angle at which sun light reaches us.

4. Atmospheric Conditions: The scattering effect can change with different atmospheric conditions, such as when there are more dust particles or droplets in the air - for example near a forest fire or 
volcanic eruption – these lead to an increased amount of blue scattered light that makes even during day time skies look whitish-blue instead!

5. Color perception: Humans perceive color based on their evolutionary background and adaptation so our eyes have evolved sensitive cells especially designed to detect contrast, which is why we see the sky 
as predominantly blue rather than violet or some other colors.
  
In summary, although not entirely violet because of Rayleigh scattering's preference for shorter wavelengths (blue), it’s primarily sunlight interacting with molecules in atmosphere and how our eyes 
perceive these light interactions that gives us skies appearing mostly blue during daytime!

deepseek-v2:16b-lite-chat-q8_0
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde'
deepseek-v2:16b-lite-chat-f16
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c'
<!-- gh-comment-id:2148464000 --> @rick-github commented on GitHub (Jun 4, 2024): Let me amend my last statement: at least one quantization level (q4_0) works, the others fail: ``` $ for i in $(docker compose exec -it ollama ollama list | grep deepseek-v2:16b | awk '{print $1}') ; do echo $i ; docker compose exec -it ollama ollama run $i 'why is the sky blue?' ; done deepseek-v2:16b-lite-chat-q4_K_M Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1fb74e7c7b5a6b355a7b9472b6d284e8375dfd9ead2159e7b5a8d07b6b3e390e' deepseek-v2:16b-lite-chat-q4_K_S Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-1625f490246286ce0f4bd3fce83436c5209832a2730d23ea4330308003613f81' deepseek-v2:16b-lite-chat-q4_1 Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-c82ca62f3aece3d0f785cde7c173234c54be8672bca1b2562b91dd806ca3406b' deepseek-v2:16b-lite-chat-q4_0 1. Scattering of light: The reason behind the color of the sky at daytime is due to a phenomenon called Rayleigh scattering, which occurs when sunlight passes through Earth's atmosphere. In this process, shorter wavelengths such as violet and blue are scattered more than longer ones like red and orange in all directions by air molecules or small particles present in the atmosphere. 2. Sunlight refraction: As sunlight enters our atmosphere, it encounters numerous air molecules which refract (bend) its path slightly. This bending of light breaks up the continuous spectrum into smaller wavelengths due to which blue light appears more prominent compared with other colors during daytime when sun is higher and away from horizon. 3. Sun's position: At sunrise or sunset, longer wavelength red and orange rays are scattered out of our line of sight by earth’s atmosphere because they don’t get bent much upon entering it; hence these become predominant making the sky appear reddish-orange instead during those times due to angle at which sun light reaches us. 4. Atmospheric Conditions: The scattering effect can change with different atmospheric conditions, such as when there are more dust particles or droplets in the air - for example near a forest fire or volcanic eruption – these lead to an increased amount of blue scattered light that makes even during day time skies look whitish-blue instead! 5. Color perception: Humans perceive color based on their evolutionary background and adaptation so our eyes have evolved sensitive cells especially designed to detect contrast, which is why we see the sky as predominantly blue rather than violet or some other colors. In summary, although not entirely violet because of Rayleigh scattering's preference for shorter wavelengths (blue), it’s primarily sunlight interacting with molecules in atmosphere and how our eyes perceive these light interactions that gives us skies appearing mostly blue during daytime! deepseek-v2:16b-lite-chat-q8_0 Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde' deepseek-v2:16b-lite-chat-f16 Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c' ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

q4_0 is the only one that fits entirely in VRAM

$ docker compose logs ollama | egrep "llm_load_tensors:.offload| model.*type"
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Small
ollama-1  | llm_load_tensors: offloading 26 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 26/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 8B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_tensors: offloading 32 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 33/33 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Medium
ollama-1  | llm_load_tensors: offloading 24 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 24/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Small
ollama-1  | llm_load_tensors: offloading 26 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 26/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_1
ollama-1  | llm_load_tensors: offloading 25 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 25/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q8_0
ollama-1  | llm_load_tensors: offloading 15 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 15/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = F16
ollama-1  | llm_load_tensors: offloading 8 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 8/28 layers to GPU
<!-- gh-comment-id:2148480423 --> @rick-github commented on GitHub (Jun 4, 2024): q4_0 is the only one that fits entirely in VRAM ``` $ docker compose logs ollama | egrep "llm_load_tensors:.offload| model.*type" ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Small ollama-1 | llm_load_tensors: offloading 26 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 26/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 8B ollama-1 | llm_load_print_meta: model ftype = Q4_0 ollama-1 | llm_load_tensors: offloading 32 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 33/33 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Medium ollama-1 | llm_load_tensors: offloading 24 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 24/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Small ollama-1 | llm_load_tensors: offloading 26 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 26/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_1 ollama-1 | llm_load_tensors: offloading 25 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 25/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_0 ollama-1 | llm_load_tensors: offloading 27 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 28/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q8_0 ollama-1 | llm_load_tensors: offloading 15 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 15/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = F16 ollama-1 | llm_load_tensors: offloading 8 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 8/28 layers to GPU ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

Killed off anything else using the GPU and deepseek q4_* now runs, after a fashion, but q8 and f16 still fail:

$ nvidia-smi
Tue Jun  4 23:54:10 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   56C    P8     9W / 200W |      3MiB / 12282MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
  
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ for i in $(docker compose exec -it ollama ollama list | grep deepseek-v2:16b | awk '{print $1}') ; do echo $i ; docker compose exec -it ollama ollama run $i 'why is the sky blue?'
 ; done
deepseek-v2:16b-lite-chat-q4_K_M
1. Scattering of sunlight: The primary reason behind a blue sky is the scattering of sunlight by Earth's atmosphere. When light from the sun enters the earth's atmosphere, it collides with molecules and
small particles present in the air such as nitrogen, oxygen, and other gases. These collisions cause the light to scatter or spread out in various directions.
2. Shortwave vs long wave radiation: Sunlight consists of different colors (wavelengths) that range from violet on the short wavelength end to red on the longer wavelength end. When sunlight enters Earth's
atmosphere, shorter-wavelength blue and violet light are scattered more than longer-wavelength reds, oranges, yellows, and greens due to their smaller wavelengths.
3. Human eye sensitivity: Our eyes have a higher sensitivity to green, yellow, and blue colors as compared to other hues, which is why we perceive these colors most prominently in the daylight hours. The
scattering of shorter wavelength light occurs more intensely than longer ones like red or orange; hence, it appears that our sky is predominantly blue.
4. Sun's position: Since sunlight consists mostly of white light with a mix of all visible wavelengths, the sun itself can appear white or slightly yellowish when viewed directly from space during daytime
hours due to its full spectrum. However, as sunlight travels through Earth's atmosphere and gets scattered by air molecules and tiny particles, it appears more blue-ish because our eyes are most sensitive
to this color compared to others.
In summary, the sky is blue primarily due to scattering of sunlight in the earth's atmosphere which scatters shorter wavelength colors (blue) much more than longer wavelengths (reds, oranges). This process
enhances the visual perception of a blue sky and helps regulate temperature and biological functions on Earth.

deepseek-v2:16b-lite-chat-q4_K_S
天空之所以呈现蓝色,是因为当阳光通过大气层时,它与气体分子和微粒发生散射。这种现象称为散射现象。在短波长中���如蓝光比红光更强烈地被散射。因此,当太阳高高挂在天空中时,它看起来是黄色的,而随着太阳的倾斜角度增
大,光线需要穿过更多的大气层,蓝光的散射增加,所以天空呈现出蓝色。这种现象称为瑞利散射(Rayleigh scattering)。

deepseek-v2:16b-lite-chat-q4_1
天空之所以呈现蓝色,是因为大气中的气体和微粒散射太阳光的视觉效果。当太阳或其他光源发出的白光进入大气层时,这些光线会遇到大气中存在的氧、氮等分子以及细小的尘埃和其他颗粒。

与直觉可能相反的是,最短的波长(紫色)的光被散射的程度比中等波长的波长(蓝色)的光要小。这是因为大气中的气体和微粒对不同颜色的光有不同的散射效应。当光线的波长较短时,它们更容易被这些气体和微粒直接反射或散射
,而不是像长波长的光那样被吸收后再重新发射。因此,相对于其他颜色,蓝色光在大气中被散射得更广泛、更强烈。

结果是,虽然太阳本身是白色的,但当光线通过大气层到达我们的眼睛时,它已经被散射的蓝色和其他颜色的光的混合物所影响。这就是我们看到的天空呈现出蓝色的原因。在日出和日落时分,太阳光线需要穿过更多的大气层才能达到
观察者的眼睛,这导致光线的更长波长的颜色(如红色和黄色)被散射得更少,而使我们看到的太阳显得更红。

总之,天空之所以是蓝色的,是由于大气中的气体和微粒对不同颜色的光的散射效应以及地球大气的光学性质共同作用的结果。

deepseek-v2:16b-lite-chat-q4_0
1. The phenomenon that you are referring to is known as Rayleigh scattering, which explains how sunlight travels through Earth's atmosphere and becomes scattered in all directions by the molecules present 
within it. This causes different wavelengths of light (colors) to spread out over a greater distance due to the small size of their particles.

2. The shorter wavelength colors like violet and blue are scattered more than other colors, which is why we see them predominately during daytime hours when sunlight passes through Earth's atmosphere on 
its way from the sun. This effect can change based on factors such as altitude, time of day (due to angle), weather conditions, and atmospheric particles present in the air that could scatter light
differently.

3. Another contributing factor is how human eyes are more sensitive to green wavelengths compared to other colors due likely to our evolutionary adaptation for better vision under natural outdoor lighting
conditions where green plants dominate habitats like forests or grasslands (though it might be surprising given many animals including humans have trichromatic color vision). So when we perceive blue sky,
our brain might not only favor the green channel but also adjust its sensitivity towards other wavelengths based on what's optimal for us under different environments.

4. The exact proportion of each wavelength scattered depends on factors like aerosol particles (like dust or pollution), cloud cover, and atmospheric pressure – which can vary throughout a day or changing
seasons resulting in subtle shifts between shades during sunrise/sunset times when Rayleigh scattering is weaker due to larger molecules present at these periods being dominant.

In summary, the blue sky occurs primarily because of Rayleigh scattering caused by sunlight passing through Earth's atmosphere where shorter wavelength colors (blue and violet) get scattered more than
others while our eyes are adapted for better color vision in natural outdoor lighting conditions favoring green wavelengths over other hues which together make up what we perceive as a predominantly blue
sky.

deepseek-v2:16b-lite-chat-q8_0  
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde'
deepseek-v2:16b-lite-chat-f16   
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c'
<!-- gh-comment-id:2148490539 --> @rick-github commented on GitHub (Jun 4, 2024): Killed off anything else using the GPU and deepseek q4_* now runs, after a fashion, but q8 and f16 still fail: ``` $ nvidia-smi Tue Jun 4 23:54:10 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 30% 56C P8 9W / 200W | 3MiB / 12282MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` ``` $ for i in $(docker compose exec -it ollama ollama list | grep deepseek-v2:16b | awk '{print $1}') ; do echo $i ; docker compose exec -it ollama ollama run $i 'why is the sky blue?' ; done deepseek-v2:16b-lite-chat-q4_K_M 1. Scattering of sunlight: The primary reason behind a blue sky is the scattering of sunlight by Earth's atmosphere. When light from the sun enters the earth's atmosphere, it collides with molecules and small particles present in the air such as nitrogen, oxygen, and other gases. These collisions cause the light to scatter or spread out in various directions. 2. Shortwave vs long wave radiation: Sunlight consists of different colors (wavelengths) that range from violet on the short wavelength end to red on the longer wavelength end. When sunlight enters Earth's atmosphere, shorter-wavelength blue and violet light are scattered more than longer-wavelength reds, oranges, yellows, and greens due to their smaller wavelengths. 3. Human eye sensitivity: Our eyes have a higher sensitivity to green, yellow, and blue colors as compared to other hues, which is why we perceive these colors most prominently in the daylight hours. The scattering of shorter wavelength light occurs more intensely than longer ones like red or orange; hence, it appears that our sky is predominantly blue. 4. Sun's position: Since sunlight consists mostly of white light with a mix of all visible wavelengths, the sun itself can appear white or slightly yellowish when viewed directly from space during daytime hours due to its full spectrum. However, as sunlight travels through Earth's atmosphere and gets scattered by air molecules and tiny particles, it appears more blue-ish because our eyes are most sensitive to this color compared to others. In summary, the sky is blue primarily due to scattering of sunlight in the earth's atmosphere which scatters shorter wavelength colors (blue) much more than longer wavelengths (reds, oranges). This process enhances the visual perception of a blue sky and helps regulate temperature and biological functions on Earth. deepseek-v2:16b-lite-chat-q4_K_S 天空之所以呈现蓝色,是因为当阳光通过大气层时,它与气体分子和微粒发生散射。这种现象称为散射现象。在短波长中���如蓝光比红光更强烈地被散射。因此,当太阳高高挂在天空中时,它看起来是黄色的,而随着太阳的倾斜角度增 大,光线需要穿过更多的大气层,蓝光的散射增加,所以天空呈现出蓝色。这种现象称为瑞利散射(Rayleigh scattering)。 deepseek-v2:16b-lite-chat-q4_1 天空之所以呈现蓝色,是因为大气中的气体和微粒散射太阳光的视觉效果。当太阳或其他光源发出的白光进入大气层时,这些光线会遇到大气中存在的氧、氮等分子以及细小的尘埃和其他颗粒。 与直觉可能相反的是,最短的波长(紫色)的光被散射的程度比中等波长的波长(蓝色)的光要小。这是因为大气中的气体和微粒对不同颜色的光有不同的散射效应。当光线的波长较短时,它们更容易被这些气体和微粒直接反射或散射 ,而不是像长波长的光那样被吸收后再重新发射。因此,相对于其他颜色,蓝色光在大气中被散射得更广泛、更强烈。 结果是,虽然太阳本身是白色的,但当光线通过大气层到达我们的眼睛时,它已经被散射的蓝色和其他颜色的光的混合物所影响。这就是我们看到的天空呈现出蓝色的原因。在日出和日落时分,太阳光线需要穿过更多的大气层才能达到 观察者的眼睛,这导致光线的更长波长的颜色(如红色和黄色)被散射得更少,而使我们看到的太阳显得更红。 总之,天空之所以是蓝色的,是由于大气中的气体和微粒对不同颜色的光的散射效应以及地球大气的光学性质共同作用的结果。 deepseek-v2:16b-lite-chat-q4_0 1. The phenomenon that you are referring to is known as Rayleigh scattering, which explains how sunlight travels through Earth's atmosphere and becomes scattered in all directions by the molecules present within it. This causes different wavelengths of light (colors) to spread out over a greater distance due to the small size of their particles. 2. The shorter wavelength colors like violet and blue are scattered more than other colors, which is why we see them predominately during daytime hours when sunlight passes through Earth's atmosphere on its way from the sun. This effect can change based on factors such as altitude, time of day (due to angle), weather conditions, and atmospheric particles present in the air that could scatter light differently. 3. Another contributing factor is how human eyes are more sensitive to green wavelengths compared to other colors due likely to our evolutionary adaptation for better vision under natural outdoor lighting conditions where green plants dominate habitats like forests or grasslands (though it might be surprising given many animals including humans have trichromatic color vision). So when we perceive blue sky, our brain might not only favor the green channel but also adjust its sensitivity towards other wavelengths based on what's optimal for us under different environments. 4. The exact proportion of each wavelength scattered depends on factors like aerosol particles (like dust or pollution), cloud cover, and atmospheric pressure – which can vary throughout a day or changing seasons resulting in subtle shifts between shades during sunrise/sunset times when Rayleigh scattering is weaker due to larger molecules present at these periods being dominant. In summary, the blue sky occurs primarily because of Rayleigh scattering caused by sunlight passing through Earth's atmosphere where shorter wavelength colors (blue and violet) get scattered more than others while our eyes are adapted for better color vision in natural outdoor lighting conditions favoring green wavelengths over other hues which together make up what we perceive as a predominantly blue sky. deepseek-v2:16b-lite-chat-q8_0 Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde' deepseek-v2:16b-lite-chat-f16 Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c' ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Medium
ollama-1  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_K - Small
ollama-1  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_1
ollama-1  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q4_0
ollama-1  | llm_load_tensors: offloading 27 repeating layers to GPU
ollama-1  | llm_load_tensors: offloading non-repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 28/28 layers to GPU
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = Q8_0
ollama-1  | llm_load_tensors: offloading 19 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 19/28 layers to GPU
ollama-1  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 424.50 MiB on device 0: cudaMalloc failed: out of memory
ollama-1  | llm_load_print_meta: model type       = 16B
ollama-1  | llm_load_print_meta: model ftype      = F16
ollama-1  | llm_load_tensors: offloading 10 repeating layers to GPU
ollama-1  | llm_load_tensors: offloaded 10/28 layers to GPU
ollama-1  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 612.00 MiB on device 0: cudaMalloc failed: out of memory
<!-- gh-comment-id:2148495238 --> @rick-github commented on GitHub (Jun 4, 2024): ``` ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Medium ollama-1 | llm_load_tensors: offloading 27 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 28/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_K - Small ollama-1 | llm_load_tensors: offloading 27 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 28/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_1 ollama-1 | llm_load_tensors: offloading 27 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 28/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q4_0 ollama-1 | llm_load_tensors: offloading 27 repeating layers to GPU ollama-1 | llm_load_tensors: offloading non-repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 28/28 layers to GPU ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = Q8_0 ollama-1 | llm_load_tensors: offloading 19 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 19/28 layers to GPU ollama-1 | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 424.50 MiB on device 0: cudaMalloc failed: out of memory ollama-1 | llm_load_print_meta: model type = 16B ollama-1 | llm_load_print_meta: model ftype = F16 ollama-1 | llm_load_tensors: offloading 10 repeating layers to GPU ollama-1 | llm_load_tensors: offloaded 10/28 layers to GPU ollama-1 | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 612.00 MiB on device 0: cudaMalloc failed: out of memory ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

ollama is perhaps too aggressive in setting --n-gpu-layers. for f16, ollama offloads 10 layers, and llama.cpp fails:

$ docker run --gpus all --shm-size="12gb" -v /media/download/ai/models:/models ghcr.io/ggerganov/llama.cpp:full-cuda -r -m /models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c --ctx-size 2048 --batch-size 512 --parallel 1 --n-gpu-layers 10 -p 'why is the sky blue?'
...
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.35 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/28 layers to GPU
llm_load_tensors:        CPU buffer size = 18806.81 MiB
llm_load_tensors:      CUDA0 buffer size = 11157.68 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:  CUDA_Host KV buffer size =   340.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   200.00 MiB
llama_new_context_with_model: KV self size  =  540.00 MiB, K (f16):  324.00 MiB, V (f16):  216.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.39 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 612.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 641728512
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c'
main: error: unable to load model 

But using 9 layers works fine:

$ docker run --gpus all --shm-size="12gb" -v /media/download/ai/models:/models ghcr.io/ggerganov/llama.cpp:full-cuda -r -m /models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c --ctx-size 2048 --batch-size 512 --parallel 1 --n-gpu-layers 9 -p 'why is the sky blue?'
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.35 MiB
llm_load_tensors: offloading 9 repeating layers to GPU
llm_load_tensors: offloaded 9/28 layers to GPU
llm_load_tensors:        CPU buffer size = 19922.57 MiB
llm_load_tensors:      CUDA0 buffer size = 10041.91 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:  CUDA_Host KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   180.00 MiB
llama_new_context_with_model: KV self size  =  540.00 MiB, K (f16):  324.00 MiB, V (f16):  216.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.39 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   612.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.01 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 288

system_info: n_threads = 8 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 1


why is the sky blue?
- The sky appears blue during the day because of a combination of many wavelengths of light emitted from the sun, absorbed and scattered by molecules and particles in the earth's atmosphere. The blue light is scattered more easily than other colors because of the size of the molecules in the atmosphere.

<!-- gh-comment-id:2148529470 --> @rick-github commented on GitHub (Jun 4, 2024): ollama is perhaps too aggressive in setting --n-gpu-layers. for f16, ollama offloads 10 layers, and llama.cpp fails: ``` $ docker run --gpus all --shm-size="12gb" -v /media/download/ai/models:/models ghcr.io/ggerganov/llama.cpp:full-cuda -r -m /models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c --ctx-size 2048 --batch-size 512 --parallel 1 --n-gpu-layers 10 -p 'why is the sky blue?' ... ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: offloading 10 repeating layers to GPU llm_load_tensors: offloaded 10/28 layers to GPU llm_load_tensors: CPU buffer size = 18806.81 MiB llm_load_tensors: CUDA0 buffer size = 11157.68 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: CUDA_Host KV buffer size = 340.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 200.00 MiB llama_new_context_with_model: KV self size = 540.00 MiB, K (f16): 324.00 MiB, V (f16): 216.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.39 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 612.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 641728512 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '/models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c' main: error: unable to load model ``` But using 9 layers works fine: ``` $ docker run --gpus all --shm-size="12gb" -v /media/download/ai/models:/models ghcr.io/ggerganov/llama.cpp:full-cuda -r -m /models/ollama/models/blobs/sha256-3ef9ac89c78b82e188cad8c75d0cb630899e08510a97d6fa41431139573f697c --ctx-size 2048 --batch-size 512 --parallel 1 --n-gpu-layers 9 -p 'why is the sky blue?' ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.35 MiB llm_load_tensors: offloading 9 repeating layers to GPU llm_load_tensors: offloaded 9/28 layers to GPU llm_load_tensors: CPU buffer size = 19922.57 MiB llm_load_tensors: CUDA0 buffer size = 10041.91 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_kv_cache_init: CUDA_Host KV buffer size = 360.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 180.00 MiB llama_new_context_with_model: KV self size = 540.00 MiB, K (f16): 324.00 MiB, V (f16): 216.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.39 MiB llama_new_context_with_model: CUDA0 compute buffer size = 612.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 14.01 MiB llama_new_context_with_model: graph nodes = 1924 llama_new_context_with_model: graph splits = 288 system_info: n_threads = 8 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 1 why is the sky blue? - The sky appears blue during the day because of a combination of many wavelengths of light emitted from the sun, absorbed and scattered by molecules and particles in the earth's atmosphere. The blue light is scattered more easily than other colors because of the size of the molecules in the atmosphere. ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

OK, can work around this by specifying the number of layers when loading the model. So after the first failure, check the logs for --n-gpu-layers and use a lower value with num_gpu:

$ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-q8_0", "prompt": "why is the sky blue?", "stream": false}'
{"error":"llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde'"}

$ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-q8_0", "options": { "num_gpu": 18}, "prompt": "why is the sky blue?", "stream": false}'
{"model":"deepseek-v2:16b-lite-chat-q8_0","created_at":"2024-06-04T23:26:32.517418458Z","response":"1. Scattering of sunlight: The primary reason behind the sky's color is Rayleigh scattering
<!-- gh-comment-id:2148573008 --> @rick-github commented on GitHub (Jun 4, 2024): OK, can work around this by specifying the number of layers when loading the model. So after the first failure, check the logs for --n-gpu-layers and use a lower value with `num_gpu`: ``` $ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-q8_0", "prompt": "why is the sky blue?", "stream": false}' {"error":"llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/root/.ollama/models/blobs/sha256-458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde'"} $ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-q8_0", "options": { "num_gpu": 18}, "prompt": "why is the sky blue?", "stream": false}' {"model":"deepseek-v2:16b-lite-chat-q8_0","created_at":"2024-06-04T23:26:32.517418458Z","response":"1. Scattering of sunlight: The primary reason behind the sky's color is Rayleigh scattering ```
Author
Owner

@rick-github commented on GitHub (Jun 4, 2024):

Alternatively, create a new model with a layer count:

$ ollama show --modelfile deepseek-v2:16b-lite-chat-q8_0 > Modelfile
$ echo "PARAMETER num_gpu 18" >> Modelfile
$ ollama create deepseek-v2:16b-lite-chat-n18-q8_0 -f Modelfile
transferring model data 
using existing layer sha256:458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde 
creating new layer sha256:61dad5982e1a6b8f1d301ee54789ac05be9800463191743fb6c1a8cbf40c18c6 
creating new layer sha256:ab0697c503d3fe44bfab65c17dcbb464c44f819d839343e3ee8c104130ffec2a 
creating new layer sha256:6f29ca49a8a8fdf0638ac5c2d4b5f5ad10f1f04651999c3a8d78a60c0b83591a 
creating new layer sha256:86bc773897595edc3bf202ad55cc32bf05d99e0032cff50423f121281ce37256 
writing manifest 
success 
$ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-n18-q8_0", "prompt": "why is the sky blue?", "stream": false}'
{"model":"deepseek-v2:16b-lite-chat-n18-q8_0","created_at":"2024-06-04T23:45:04.03408641Z","response":"1. The color of the sky at midday, when observed on a clear day, appears p

But I imagine this will still cause problems when something else is also using the GPU.

<!-- gh-comment-id:2148590519 --> @rick-github commented on GitHub (Jun 4, 2024): Alternatively, create a new model with a layer count: ``` $ ollama show --modelfile deepseek-v2:16b-lite-chat-q8_0 > Modelfile $ echo "PARAMETER num_gpu 18" >> Modelfile $ ollama create deepseek-v2:16b-lite-chat-n18-q8_0 -f Modelfile transferring model data using existing layer sha256:458d8dbb5c64109623751f3c7e691f285770a6521bf06bf86172980b995b3bde creating new layer sha256:61dad5982e1a6b8f1d301ee54789ac05be9800463191743fb6c1a8cbf40c18c6 creating new layer sha256:ab0697c503d3fe44bfab65c17dcbb464c44f819d839343e3ee8c104130ffec2a creating new layer sha256:6f29ca49a8a8fdf0638ac5c2d4b5f5ad10f1f04651999c3a8d78a60c0b83591a creating new layer sha256:86bc773897595edc3bf202ad55cc32bf05d99e0032cff50423f121281ce37256 writing manifest success ``` ``` $ curl -s http://ollama:11434/api/generate -d '{"model": "deepseek-v2:16b-lite-chat-n18-q8_0", "prompt": "why is the sky blue?", "stream": false}' {"model":"deepseek-v2:16b-lite-chat-n18-q8_0","created_at":"2024-06-04T23:45:04.03408641Z","response":"1. The color of the sky at midday, when observed on a clear day, appears p ``` But I imagine this will still cause problems when something else is also using the GPU.
Author
Owner

@rick-github commented on GitHub (Jun 5, 2024):

Another mitigation technique might be to enable flash attention (OLLAMA_FLASH_ATTENTION=1 in the environment) as this reduces the size of the CUDA0 compute buffer, which is where the cudaMalloc failure occurs. Unfortunately deepseek-v2 uses different dimensions for query and value vectors, which causes llama.cpp to assert fail:

GGML_ASSERT: ggml.c:5715: ggml_nelements(a) == ne0*ne1

This is an open issue with llama.cpp (https://github.com/ggerganov/llama.cpp/issues/7343) so this may be possible in the future.

<!-- gh-comment-id:2151090861 --> @rick-github commented on GitHub (Jun 5, 2024): Another mitigation technique might be to enable flash attention (OLLAMA_FLASH_ATTENTION=1 in the environment) as this reduces the size of the CUDA0 compute buffer, which is where the cudaMalloc failure occurs. Unfortunately deepseek-v2 uses different dimensions for query and value vectors, which causes llama.cpp to assert fail: ``` GGML_ASSERT: ggml.c:5715: ggml_nelements(a) == ne0*ne1 ``` This is an open issue with llama.cpp (https://github.com/ggerganov/llama.cpp/issues/7343) so this may be possible in the future.
Author
Owner

@DirtyKnightForVi commented on GitHub (Jun 6, 2024):

I am using version 0.1.41 and have set OLLAMA_FLASH_ATTENTION=1. For num_ctx, I even set it to 9 (4xA100), but it still doesn't work. It gives an error when running ollama run:

Error: llama runner process has terminated: signal: aborted (core dumped)
<!-- gh-comment-id:2151371871 --> @DirtyKnightForVi commented on GitHub (Jun 6, 2024): I am using version 0.1.41 and have set `OLLAMA_FLASH_ATTENTION=1`. For `num_ctx`, I even set it to 9 (4xA100), but it still doesn't work. It gives an error when running `ollama run`: ``` Error: llama runner process has terminated: signal: aborted (core dumped) ```
Author
Owner

@rick-github commented on GitHub (Jun 6, 2024):

Setting OLLAMA_FLASH_ATTENTION won't help until #7343 is fixed.

Set num_gpu, not num_ctx.

<!-- gh-comment-id:2152374840 --> @rick-github commented on GitHub (Jun 6, 2024): Setting OLLAMA_FLASH_ATTENTION won't help until #7343 is fixed. Set `num_gpu`, not `num_ctx`.
Author
Owner

@DirtyKnightForVi commented on GitHub (Jun 6, 2024):

My bad. I did set num_gpu, but it still doesn't work. You can see I set it to 9.

<!-- gh-comment-id:2152425706 --> @DirtyKnightForVi commented on GitHub (Jun 6, 2024): My bad. I did set `num_gpu`, but it still doesn't work. You can see I set it to 9.
Author
Owner

@rick-github commented on GitHub (Jun 6, 2024):

In your logs, what does ollama set --n-gpu-layers to when you don't override it? Does it fail with cudaMalloc failed or some other error? Does it fail while allocating CUDA0 compute buffer or at some different point during the model load?

<!-- gh-comment-id:2152460926 --> @rick-github commented on GitHub (Jun 6, 2024): In your logs, what does ollama set `--n-gpu-layers` to when you don't override it? Does it fail with `cudaMalloc failed` or some other error? Does it fail while allocating `CUDA0 compute buffer` or at some different point during the model load?
Author
Owner

@chigkim commented on GitHub (Jun 7, 2024):

I'm also having a problem running deepseek-v2.

ollama run deepseek-v2 gives "Error: llama runner process has terminated: signal: abort trap."

I'm running Ollama v0.1.41 on M3 Max 64gb, and all my other models work fine.

<!-- gh-comment-id:2154214403 --> @chigkim commented on GitHub (Jun 7, 2024): I'm also having a problem running deepseek-v2. `ollama run deepseek-v2` gives "Error: llama runner process has terminated: signal: abort trap." I'm running Ollama v0.1.41 on M3 Max 64gb, and all my other models work fine.
Author
Owner

@rick-github commented on GitHub (Jun 7, 2024):

Unlikely to be a number of gpu layers problem since you are using a Mac. What do the logs show?

<!-- gh-comment-id:2154893082 --> @rick-github commented on GitHub (Jun 7, 2024): Unlikely to be a number of gpu layers problem since you are using a Mac. What do the logs show?
Author
Owner

@DirtyKnightForVi commented on GitHub (Jun 7, 2024):

It is strange that it work well now... i do nothing about anyother parameters.

but i meet a new problem: Parameters Override Like Llama.cpp

Can you do me this favor?

<!-- gh-comment-id:2155086691 --> @DirtyKnightForVi commented on GitHub (Jun 7, 2024): It is strange that it work well now... i do nothing about anyother parameters. but i meet a new problem: [Parameters Override Like Llama.cpp](https://github.com/ollama/ollama/issues/4904) Can you do me this favor?
Author
Owner

@rick-github commented on GitHub (Jun 7, 2024):

As far as I know, there's no way to pass KV overrides through ollama. The usual way to do this would be to create a quant with the required values.

<!-- gh-comment-id:2155151025 --> @rick-github commented on GitHub (Jun 7, 2024): As far as I know, there's no way to pass KV overrides through ollama. The usual way to do this would be to create a quant with the required values.
Author
Owner

@DirtyKnightForVi commented on GitHub (Jun 8, 2024):

The current issue is that the parameters do exist, but the types seem to be incorrect.
Run with llama.cpp, for example wrong type i32 but expected type u32

<!-- gh-comment-id:2155721421 --> @DirtyKnightForVi commented on GitHub (Jun 8, 2024): The current issue is that the parameters do exist, but the types seem to be incorrect. Run with llama.cpp, for example `wrong type i32 but expected type u32`
Author
Owner

@chigkim commented on GitHub (Jun 10, 2024):

Unlikely to be a number of gpu layers problem since you are using a Mac. What do the logs show?

Here is my log.
error.txt

<!-- gh-comment-id:2158782682 --> @chigkim commented on GitHub (Jun 10, 2024): > Unlikely to be a number of gpu layers problem since you are using a Mac. What do the logs show? Here is my log. [error.txt](https://github.com/user-attachments/files/15776093/error.txt)
Author
Owner

@rick-github commented on GitHub (Jun 10, 2024):

You are loading the model with OLLAMA_FLASH_ATTENTION=1, which is not supported for deepseek-v2, causing it to assert fail:

GGML_ASSERT: /Users/runner/work/ollama/ollama/llm/llama.cpp/ggml.c:5714: ggml_nelements(a) == ne0*ne1

Flash attention with deepseek-v2 models won't work until https://github.com/ggerganov/llama.cpp/issues/7343 is resolved.

Disabling flash attention should improve your ability to load the model.

<!-- gh-comment-id:2158829353 --> @rick-github commented on GitHub (Jun 10, 2024): You are loading the model with OLLAMA_FLASH_ATTENTION=1, which is not supported for deepseek-v2, causing it to assert fail: ``` GGML_ASSERT: /Users/runner/work/ollama/ollama/llm/llama.cpp/ggml.c:5714: ggml_nelements(a) == ne0*ne1 ``` Flash attention with deepseek-v2 models won't work until https://github.com/ggerganov/llama.cpp/issues/7343 is resolved. Disabling flash attention should improve your ability to load the model.
Author
Owner

@chigkim commented on GitHub (Jun 11, 2024):

Ah ok. Thanks for the info! I guess I'll just take it out of the service until FA is supported like all other models I use.
Interestingly, I tried running with FA, but it only answered in Chinese even though I asked in English. :(

<!-- gh-comment-id:2160246066 --> @chigkim commented on GitHub (Jun 11, 2024): Ah ok. Thanks for the info! I guess I'll just take it out of the service until FA is supported like all other models I use. Interestingly, I tried running with FA, but it only answered in Chinese even though I asked in English. :(
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65065