[GH-ISSUE #5522] deepseek-coder-v2:236b - Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/...path/to/blob #65486

Open
opened 2026-05-03 21:27:28 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @scouzi1966 on GitHub (Jul 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5522

What is the issue?

I've had this issue for a while with earlier version of ollama and latest with and Intel SPR 8480+ and RTX 4090. The num_gpu parameter has been removed from model file so I can no longer reduce layers sent to GPU. I sends 10 and I can't test with 9,8 etc. I can run all other models without any issue.

I have 24 GB of VRAM on my 4090 (nothing else loaded) and 320 GB of main memory. Ubuntu 22.04 and Nvidia driver 550.54.14 and CUDA 12.4

Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 64.872µs | 127.0.0.1 | HEAD "/"
Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 18.056618ms | 127.0.0.1 | POST "/api/show"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.282-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=61 layers.offload=10 layers.split="" memory.available="[23.3 GiB]" memory.required.full="134.5 GiB" memory.required.partial="22.1 GiB" memory.required.kv="9.4 GiB" memory.required.allocations="[22.1 GiB]" memory.weights.total="132.5 GiB" memory.weights.repeating="132.1 GiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="642.0 MiB" memory.graph.partial="891.5 MiB"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.283-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama1660031732/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 10 --parallel 1 --port 43475"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] build info | build=1 commit="7c26775" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] system info | n_threads=56 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140507113369600" timestamp=1720308195 total_threads=112
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="111" port="43475" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 (version GGUF V3 (latest))
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 0: general.architecture str = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 11: general.file_type u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 19: deepseek2.expert_count u32 = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 16.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 22: deepseek2.rope.dimension_count u32 = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 23: deepseek2.rope.scaling.type str = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 24: deepseek2.rope.scaling.factor f32 = 40.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 25: deepseek2.rope.scaling.original_context_length u32 = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 26: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = deepseek-llm
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,102400] = ["!", """, "#", "$", "%", "&", "'", ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 37: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 38: general.quantization_version u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type f32: 300 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q4_0: 658 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q6_K: 1 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.536-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: special tokens cache size = 2400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: token to piece cache size = 0.6661 MB
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: format = GGUF V3 (latest)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: arch = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: vocab type = BPE
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_vocab = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_merges = 99757
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_train = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head_kv = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_rot = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_k = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_v = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_gqa = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_k_gqa = 24576
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_v_gqa = 16384
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_used = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: causal attn = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: pooling type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope scaling = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_base_train = 10000.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_scale_train = 0.025
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_orig_yarn = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_finetuned = unknown
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_conv = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_inner = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_state = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_dt_rank = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model type = 236B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model ftype = Q4_0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model params = 235.74 B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model size = 123.78 GiB (4.51 BPW)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: LF token = 126 'Ä'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer_dense_lead = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_q = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_kv = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff_exp = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_shared = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: expert_weights_scale = 16.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_yarn_log_mul = 0.1000
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: found 1 CUDA devices:
Jul 06 19:23:15 ubuntux ollama[169742]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_tensors: ggml ctx size = 0.87 MiB
Jul 06 19:23:16 ubuntux ollama[169742]: time=2024-07-06T19:23:16.992-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:19 ubuntux ollama[169742]: time=2024-07-06T19:23:19.191-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloading 10 repeating layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloaded 10/61 layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CPU buffer size = 105416.00 MiB
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ctx = 2048
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_batch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ubatch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: flash_attn = 0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_base = 10000.0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_scale = 0.025
Jul 06 19:23:21 ubuntux ollama[169742]: time=2024-07-06T19:23:21.904-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:23 ubuntux ollama[169742]: time=2024-07-06T19:23:23.601-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA_Host KV buffer size = 8000.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: KV self size = 9600.00 MiB, K (f16): 5760.00 MiB, V (f16): 3840.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 842.00 MiB on device 0: cudaMalloc failed: out of memory
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: failed to allocate compute buffers
Jul 06 19:23:25 ubuntux ollama[169742]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.314-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:26 ubuntux ollama[584320]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408" tid="140507113369600" timestamp=1720308206
Jul 06 19:23:26 ubuntux ollama[169742]: terminate called without an active exception
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.566-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.817-04:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'"
Jul 06 19:23:26 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:26 | 500 | 11.766669885s | 127.0.0.1 | POST "/api/chat"
Jul 06 19:23:31 ubuntux ollama[169742]: time=2024-07-06T19:23:31.944-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.127181114 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.194-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376988446 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.444-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.626674402 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @scouzi1966 on GitHub (Jul 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5522 ### What is the issue? I've had this issue for a while with earlier version of ollama and latest with and Intel SPR 8480+ and RTX 4090. The num_gpu parameter has been removed from model file so I can no longer reduce layers sent to GPU. I sends 10 and I can't test with 9,8 etc. I can run all other models without any issue. I have 24 GB of VRAM on my 4090 (nothing else loaded) and 320 GB of main memory. Ubuntu 22.04 and Nvidia driver 550.54.14 and CUDA 12.4 Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 64.872µs | 127.0.0.1 | HEAD "/" Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 18.056618ms | 127.0.0.1 | POST "/api/show" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.282-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=61 layers.offload=10 layers.split="" memory.available="[23.3 GiB]" memory.required.full="134.5 GiB" memory.required.partial="22.1 GiB" memory.required.kv="9.4 GiB" memory.required.allocations="[22.1 GiB]" memory.weights.total="132.5 GiB" memory.weights.repeating="132.1 GiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="642.0 MiB" memory.graph.partial="891.5 MiB" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.283-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama1660031732/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 10 --parallel 1 --port 43475" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] build info | build=1 commit="7c26775" tid="140507113369600" timestamp=1720308195 Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] system info | n_threads=56 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140507113369600" timestamp=1720308195 total_threads=112 Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="111" port="43475" tid="140507113369600" timestamp=1720308195 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 (version GGUF V3 (latest)) Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 0: general.architecture str = deepseek2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Instruct Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 60 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 5120 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 12288 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 11: general.file_type u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 19: deepseek2.expert_count u32 = 160 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 16.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 22: deepseek2.rope.dimension_count u32 = 64 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 23: deepseek2.rope.scaling.type str = yarn Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 24: deepseek2.rope.scaling.factor f32 = 40.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 25: deepseek2.rope.scaling.original_context_length u32 = 4096 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 26: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = deepseek-llm Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 100000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 100001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 100001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 37: tokenizer.chat_template str = {% if not add_generation_prompt is de... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 38: general.quantization_version u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type f32: 300 tensors Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q4_0: 658 tensors Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q6_K: 1 tensors Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.536-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: special tokens cache size = 2400 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: token to piece cache size = 0.6661 MB Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: format = GGUF V3 (latest) Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: arch = deepseek2 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: vocab type = BPE Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_vocab = 102400 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_merges = 99757 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_train = 163840 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd = 5120 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head_kv = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer = 60 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_rot = 64 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_k = 192 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_v = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_gqa = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_k_gqa = 24576 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_v_gqa = 16384 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff = 12288 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert = 160 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_used = 6 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: causal attn = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: pooling type = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope type = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope scaling = yarn Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_base_train = 10000.0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_scale_train = 0.025 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_orig_yarn = 4096 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_finetuned = unknown Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_conv = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_inner = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_state = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_dt_rank = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model type = 236B Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model ftype = Q4_0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model params = 235.74 B Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model size = 123.78 GiB (4.51 BPW) Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: LF token = 126 'Ä' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer_dense_lead = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_q = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_kv = 512 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff_exp = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_shared = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: expert_weights_scale = 16.0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_yarn_log_mul = 0.1000 Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: found 1 CUDA devices: Jul 06 19:23:15 ubuntux ollama[169742]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_tensors: ggml ctx size = 0.87 MiB Jul 06 19:23:16 ubuntux ollama[169742]: time=2024-07-06T19:23:16.992-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:19 ubuntux ollama[169742]: time=2024-07-06T19:23:19.191-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloading 10 repeating layers to GPU Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloaded 10/61 layers to GPU Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CPU buffer size = 105416.00 MiB Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ctx = 2048 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_batch = 512 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ubatch = 512 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: flash_attn = 0 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_base = 10000.0 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_scale = 0.025 Jul 06 19:23:21 ubuntux ollama[169742]: time=2024-07-06T19:23:21.904-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:23 ubuntux ollama[169742]: time=2024-07-06T19:23:23.601-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA_Host KV buffer size = 8000.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: KV self size = 9600.00 MiB, K (f16): 5760.00 MiB, V (f16): 3840.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Jul 06 19:23:24 ubuntux ollama[169742]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 842.00 MiB on device 0: cudaMalloc failed: out of memory Jul 06 19:23:24 ubuntux ollama[169742]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040 Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: failed to allocate compute buffers Jul 06 19:23:25 ubuntux ollama[169742]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.314-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:26 ubuntux ollama[584320]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408" tid="140507113369600" timestamp=1720308206 Jul 06 19:23:26 ubuntux ollama[169742]: terminate called without an active exception Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.566-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.817-04:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'" Jul 06 19:23:26 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:26 | 500 | 11.766669885s | 127.0.0.1 | POST "/api/chat" Jul 06 19:23:31 ubuntux ollama[169742]: time=2024-07-06T19:23:31.944-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.127181114 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.194-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376988446 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.444-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.626674402 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the bug label 2026-05-03 21:27:28 -05:00
Author
Owner

@olumolu commented on GitHub (Jul 8, 2024):

i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely

<!-- gh-comment-id:2213854544 --> @olumolu commented on GitHub (Jul 8, 2024): i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely
Author
Owner

@scouzi1966 commented on GitHub (Jul 8, 2024):

i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely

My issue is with the 236b model. Quite a large difference with the 16b

<!-- gh-comment-id:2214272262 --> @scouzi1966 commented on GitHub (Jul 8, 2024): > i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely My issue is with the 236b model. Quite a large difference with the 16b
Author
Owner

@Ramzee-S commented on GitHub (Jul 13, 2024):

Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error.
ollama run deepseek-coder-v2:236b
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it.
ollama version is 0.1.47 enough local disk space too.
Any help would be appreciated.

<!-- gh-comment-id:2226830311 --> @Ramzee-S commented on GitHub (Jul 13, 2024): Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error. ollama run deepseek-coder-v2:236b Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it. ollama version is 0.1.47 enough local disk space too. Any help would be appreciated.
Author
Owner

@scouzi1966 commented on GitHub (Jul 15, 2024):

Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error.
ollama run deepseek-coder-v2:236b
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it.
ollama version is 0.1.47 enough local disk space too.
Any help would be appreciated.

How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?

<!-- gh-comment-id:2227572583 --> @scouzi1966 commented on GitHub (Jul 15, 2024): > Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error. > ollama run deepseek-coder-v2:236b > Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' > Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it. > ollama version is 0.1.47 enough local disk space too. > Any help would be appreciated. How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?
Author
Owner

@Ramzee-S commented on GitHub (Jul 15, 2024):

How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?

By accident my ubuntu nvidia driver got updated, a few days ago, then i needed a reboot to get the nvidia drivers working again. So when i did nvidia-smi nothing was showing up except some message that versions did not match.
Then i tried the deepseek-coder-v2 model and it ran! it was actually quite good. After reboot it stopped working. I tried to replicate it by removing the physical gpu's and then the model works. Then putting back the gpu's it of course did not work again. Trying to achieve a way to disable gpu temporary by some Environment variables, did not succeed yet. So i have no easy fix for this, and am using other models until maybe a new version of ollama will run this model.

Edit OK found a fix from here
You can disable the gpu's on linux (works on ubuntu 22) with
determine the device index using or nvidia-smi but better:

lspci | grep NVIDIA

Then to deactivate use the device id instead of the '0000:xx:00.0' string in the following

nvidia-smi -i 0000:xx:00.0 -pm 0
nvidia-smi drain -p 0000:xx:00.0 -m 1

to activate use: "```
nvidia-smi drain -p 0000:xx:00.0 -m 0" replacing the idea agan.
Now for me deepseek is working again. But of course partial gpu ofloading would be nicer. Because just an an example:
total duration: 3m24.609638164s
load duration: 46.016846ms
prompt eval count: 424 token(s)
prompt eval duration: 26.198928s
prompt eval rate: 16.18 tokens/s
eval count: 525 token(s)
eval duration: 2m58.207292s
eval rate: 2.95 tokens/s

<!-- gh-comment-id:2228053915 --> @Ramzee-S commented on GitHub (Jul 15, 2024): > How do you get Ollama to ignore your GPU? Or how do you disable it on Linux? By accident my ubuntu nvidia driver got updated, a few days ago, then i needed a reboot to get the nvidia drivers working again. So when i did nvidia-smi nothing was showing up except some message that versions did not match. Then i tried the deepseek-coder-v2 model and it ran! it was actually quite good. After reboot it stopped working. I tried to replicate it by removing the physical gpu's and then the model works. Then putting back the gpu's it of course did not work again. Trying to achieve a way to disable gpu temporary by some Environment variables, did not succeed yet. So i have no easy fix for this, and am using other models until maybe a new version of ollama will run this model. Edit OK found a fix from [here](https://unix.stackexchange.com/questions/654075/how-can-i-disable-and-later-re-enable-one-of-my-nvidia-gpus) You can disable the gpu's on linux (works on ubuntu 22) with determine the device index using or `nvidia-smi` but better: ``` lspci | grep NVIDIA ``` Then to deactivate use the device id instead of the '0000:xx:00.0' string in the following ``` nvidia-smi -i 0000:xx:00.0 -pm 0 nvidia-smi drain -p 0000:xx:00.0 -m 1 ``` to activate use: "``` nvidia-smi drain -p 0000:xx:00.0 -m 0" replacing the idea agan. Now for me deepseek is working again. But of course partial gpu ofloading would be nicer. Because just an an example: total duration: 3m24.609638164s load duration: 46.016846ms prompt eval count: 424 token(s) prompt eval duration: 26.198928s prompt eval rate: 16.18 tokens/s eval count: 525 token(s) eval duration: 2m58.207292s eval rate: 2.95 tokens/s
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

Trying to achieve a way to disable gpu temporary by some Environment variables

You can use OLLAMA_LLM_LIBRARY to force a CPU based runner (e.g. OLLAMA_LLM_LIBRARY=cpu_avx2)

I've also posted a PR #5922 to add a new GPU overhead setting to bring back a viable workaround when the memory predictions are incorrect.

<!-- gh-comment-id:2248671761 --> @dhiltgen commented on GitHub (Jul 24, 2024): > Trying to achieve a way to disable gpu temporary by some Environment variables You can use `OLLAMA_LLM_LIBRARY` to force a CPU based runner (e.g. `OLLAMA_LLM_LIBRARY=cpu_avx2`) I've also posted a PR #5922 to add a new GPU overhead setting to bring back a viable workaround when the memory predictions are incorrect.
Author
Owner

@gsoul commented on GitHub (Aug 27, 2024):

I'm experiencing the same issue as the topic-starter. Is there a chance someone figured out the workaround for now?

<!-- gh-comment-id:2312468725 --> @gsoul commented on GitHub (Aug 27, 2024): I'm experiencing the same issue as the topic-starter. Is there a chance someone figured out the workaround for now?
Author
Owner

@pftg commented on GitHub (Sep 10, 2024):

The same issue for macOS

<!-- gh-comment-id:2339783325 --> @pftg commented on GitHub (Sep 10, 2024): The same issue for macOS
Author
Owner

@gsoul commented on GitHub (Sep 11, 2024):

It's fixed for me now. @pftg try to upgrade to the latest version and play around with OLLAMA_GPU_OVERHEAD env parameter: https://github.com/ollama/ollama/pull/5922

<!-- gh-comment-id:2342980251 --> @gsoul commented on GitHub (Sep 11, 2024): It's fixed for me now. @pftg try to upgrade to the latest version and play around with OLLAMA_GPU_OVERHEAD env parameter: https://github.com/ollama/ollama/pull/5922
Author
Owner

@pftg commented on GitHub (Sep 11, 2024):

@gsoul I tried OLLAMA_GPU_OVERHEAD for deepseek-v2:236b and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.

<!-- gh-comment-id:2343012569 --> @pftg commented on GitHub (Sep 11, 2024): @gsoul I tried OLLAMA_GPU_OVERHEAD for `deepseek-v2:236b` and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.
Author
Owner

@olumolu commented on GitHub (Sep 11, 2024):

@gsoul I tried OLLAMA_GPU_OVERHEAD for deepseek-v2:236b and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.

How much ram you have?

<!-- gh-comment-id:2343197452 --> @olumolu commented on GitHub (Sep 11, 2024): > @gsoul I tried OLLAMA_GPU_OVERHEAD for `deepseek-v2:236b` and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version. How much ram you have?
Author
Owner

@pftg commented on GitHub (Sep 11, 2024):

@olumolu 16GB

<!-- gh-comment-id:2343206016 --> @pftg commented on GitHub (Sep 11, 2024): @olumolu 16GB
Author
Owner

@olumolu commented on GitHub (Sep 11, 2024):

@olumolu 16GB

No with 16 gb you cant even run gemma 27b
128 gb ram minimum with arch without desktop environment and with zswap or vram can barely run that model. To run that 160-196gb ram is recommended yo run that model.

<!-- gh-comment-id:2343214983 --> @olumolu commented on GitHub (Sep 11, 2024): > @olumolu 16GB No with 16 gb you cant even run gemma 27b 128 gb ram minimum with arch without desktop environment and with zswap or vram can barely run that model. To run that 160-196gb ram is recommended yo run that model.
Author
Owner

@Ramzee-S commented on GitHub (Sep 11, 2024):

I am quite sure that with 16gb RAM or VRAM you wont be able to run deepseek 236b any productive way. (The model size in ram is 133GB. and a big fast ssd swap will just also be to slow). The issues above also occured with 512GB of ram and 2x rtx 3090 with total 48GB VRAM, and users with 196GB RAM.
The issues above also seem related to this thread on llama.cpp.
https://github.com/ggerganov/llama.cpp/discussions/8520
Running deepseek 236gb in llama.cpp directly also gave issues, and i think these were related to some of the ollama issues some are experiencing. When running llama.cpp by default a fixed fraction of the model size was used as a context size multiplier.
Which resulted in very high memory allocation when loading/starting the model, which failed in some cases. Basically you could have enough memory for the model, but not for the default allocated context size. If smaller context window was manually allocated then things worked with llama.cpp. I would not be surprized if setting lower context defaults in ollama will fix some of the issues. However some other issues "things working without cpu, and not with gpu", seems a seperate different issue.

<!-- gh-comment-id:2343357076 --> @Ramzee-S commented on GitHub (Sep 11, 2024): I am quite sure that with 16gb RAM or VRAM you wont be able to run deepseek 236b any productive way. (The model size in ram is 133GB. and a big fast ssd swap will just also be to slow). The issues above also occured with 512GB of ram and 2x rtx 3090 with total 48GB VRAM, and users with 196GB RAM. The issues above also seem related to this thread on llama.cpp. https://github.com/ggerganov/llama.cpp/discussions/8520 Running deepseek 236gb in llama.cpp directly also gave issues, and i think these were related to some of the ollama issues some are experiencing. When running llama.cpp by default a fixed fraction of the model size was used as a context size multiplier. Which resulted in very high memory allocation when loading/starting the model, which failed in some cases. Basically you could have enough memory for the model, but not for the default allocated context size. If smaller context window was manually allocated then things worked with llama.cpp. I would not be surprized if setting lower context defaults in ollama will fix some of the issues. However some other issues "things working without cpu, and not with gpu", seems a seperate different issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65486