[GH-ISSUE #5522] deepseek-coder-v2:236b - Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/...path/to/blob #65486

New Issue

GiteaMirror · 2026-05-03T21:27:28-05:00

GiteaMirror commented

2026-05-03 21:27:28 -05:00

Originally created by @scouzi1966 on GitHub (Jul 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5522

What is the issue?

I've had this issue for a while with earlier version of ollama and latest with and Intel SPR 8480+ and RTX 4090. The num_gpu parameter has been removed from model file so I can no longer reduce layers sent to GPU. I sends 10 and I can't test with 9,8 etc. I can run all other models without any issue.

I have 24 GB of VRAM on my 4090 (nothing else loaded) and 320 GB of main memory. Ubuntu 22.04 and Nvidia driver 550.54.14 and CUDA 12.4

Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 64.872µs | 127.0.0.1 | HEAD "/"
Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 18.056618ms | 127.0.0.1 | POST "/api/show"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.282-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=61 layers.offload=10 layers.split="" memory.available="[23.3 GiB]" memory.required.full="134.5 GiB" memory.required.partial="22.1 GiB" memory.required.kv="9.4 GiB" memory.required.allocations="[22.1 GiB]" memory.weights.total="132.5 GiB" memory.weights.repeating="132.1 GiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="642.0 MiB" memory.graph.partial="891.5 MiB"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.283-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama1660031732/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 10 --parallel 1 --port 43475"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] build info | build=1 commit="7c26775" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] system info | n_threads=56 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140507113369600" timestamp=1720308195 total_threads=112
Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="111" port="43475" tid="140507113369600" timestamp=1720308195
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 (version GGUF V3 (latest))
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 0: general.architecture str = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 11: general.file_type u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 19: deepseek2.expert_count u32 = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 16.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 22: deepseek2.rope.dimension_count u32 = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 23: deepseek2.rope.scaling.type str = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 24: deepseek2.rope.scaling.factor f32 = 40.000000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 25: deepseek2.rope.scaling.original_context_length u32 = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 26: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = deepseek-llm
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,102400] = ["!", """, "#", "$", "%", "&", "'", ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 100000
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 100001
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 37: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 38: general.quantization_version u32 = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type f32: 300 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q4_0: 658 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q6_K: 1 tensors
Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.536-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: special tokens cache size = 2400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: token to piece cache size = 0.6661 MB
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: format = GGUF V3 (latest)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: arch = deepseek2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: vocab type = BPE
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_vocab = 102400
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_merges = 99757
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_train = 163840
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd = 5120
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head_kv = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer = 60
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_rot = 64
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_k = 192
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_v = 128
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_gqa = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_k_gqa = 24576
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_v_gqa = 16384
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_eps = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_logit_scale = 0.0e+00
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff = 12288
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert = 160
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_used = 6
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: causal attn = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: pooling type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope type = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope scaling = yarn
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_base_train = 10000.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_scale_train = 0.025
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_orig_yarn = 4096
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_finetuned = unknown
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_conv = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_inner = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_state = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_dt_rank = 0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model type = 236B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model ftype = Q4_0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model params = 235.74 B
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model size = 123.78 GiB (4.51 BPW)
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: BOS token = 100000 '<｜begin▁of▁sentence｜>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: EOS token = 100001 '<｜end▁of▁sentence｜>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: PAD token = 100001 '<｜end▁of▁sentence｜>'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: LF token = 126 'Ä'
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer_dense_lead = 1
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_q = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_kv = 512
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff_exp = 1536
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_shared = 2
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: expert_weights_scale = 16.0
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_yarn_log_mul = 0.1000
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: found 1 CUDA devices:
Jul 06 19:23:15 ubuntux ollama[169742]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_tensors: ggml ctx size = 0.87 MiB
Jul 06 19:23:16 ubuntux ollama[169742]: time=2024-07-06T19:23:16.992-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:19 ubuntux ollama[169742]: time=2024-07-06T19:23:19.191-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloading 10 repeating layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloaded 10/61 layers to GPU
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CPU buffer size = 105416.00 MiB
Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ctx = 2048
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_batch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ubatch = 512
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: flash_attn = 0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_base = 10000.0
Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_scale = 0.025
Jul 06 19:23:21 ubuntux ollama[169742]: time=2024-07-06T19:23:21.904-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:23 ubuntux ollama[169742]: time=2024-07-06T19:23:23.601-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA_Host KV buffer size = 8000.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: KV self size = 9600.00 MiB, K (f16): 5760.00 MiB, V (f16): 3840.00 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 842.00 MiB on device 0: cudaMalloc failed: out of memory
Jul 06 19:23:24 ubuntux ollama[169742]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040
Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: failed to allocate compute buffers
Jul 06 19:23:25 ubuntux ollama[169742]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.314-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding"
Jul 06 19:23:26 ubuntux ollama[584320]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408" tid="140507113369600" timestamp=1720308206
Jul 06 19:23:26 ubuntux ollama[169742]: terminate called without an active exception
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.566-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.817-04:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'"
Jul 06 19:23:26 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:26 | 500 | 11.766669885s | 127.0.0.1 | POST "/api/chat"
Jul 06 19:23:31 ubuntux ollama[169742]: time=2024-07-06T19:23:31.944-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.127181114 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.194-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376988446 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408
Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.444-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.626674402 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @scouzi1966 on GitHub (Jul 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5522 ### What is the issue? I've had this issue for a while with earlier version of ollama and latest with and Intel SPR 8480+ and RTX 4090. The num_gpu parameter has been removed from model file so I can no longer reduce layers sent to GPU. I sends 10 and I can't test with 9,8 etc. I can run all other models without any issue. I have 24 GB of VRAM on my 4090 (nothing else loaded) and 320 GB of main memory. Ubuntu 22.04 and Nvidia driver 550.54.14 and CUDA 12.4 Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 64.872µs | 127.0.0.1 | HEAD "/" Jul 06 19:23:15 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:15 | 200 | 18.056618ms | 127.0.0.1 | POST "/api/show" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.282-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=61 layers.offload=10 layers.split="" memory.available="[23.3 GiB]" memory.required.full="134.5 GiB" memory.required.partial="22.1 GiB" memory.required.kv="9.4 GiB" memory.required.allocations="[22.1 GiB]" memory.weights.total="132.5 GiB" memory.weights.repeating="132.1 GiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="642.0 MiB" memory.graph.partial="891.5 MiB" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.283-04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama1660031732/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 10 --parallel 1 --port 43475" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.284-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] build info | build=1 commit="7c26775" tid="140507113369600" timestamp=1720308195 Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] system info | n_threads=56 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140507113369600" timestamp=1720308195 total_threads=112 Jul 06 19:23:15 ubuntux ollama[584320]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="111" port="43475" tid="140507113369600" timestamp=1720308195 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 (version GGUF V3 (latest)) Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 0: general.architecture str = deepseek2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 1: general.name str = DeepSeek-Coder-V2-Instruct Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 2: deepseek2.block_count u32 = 60 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 5120 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 12288 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 11: general.file_type u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 19: deepseek2.expert_count u32 = 160 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 16.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 22: deepseek2.rope.dimension_count u32 = 64 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 23: deepseek2.rope.scaling.type str = yarn Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 24: deepseek2.rope.scaling.factor f32 = 40.000000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 25: deepseek2.rope.scaling.original_context_length u32 = 4096 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 26: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 28: tokenizer.ggml.pre str = deepseek-llm Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 100000 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 100001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 100001 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 37: tokenizer.chat_template str = {% if not add_generation_prompt is de... Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - kv 38: general.quantization_version u32 = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type f32: 300 tensors Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q4_0: 658 tensors Jul 06 19:23:15 ubuntux ollama[169742]: llama_model_loader: - type q6_K: 1 tensors Jul 06 19:23:15 ubuntux ollama[169742]: time=2024-07-06T19:23:15.536-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: special tokens cache size = 2400 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_vocab: token to piece cache size = 0.6661 MB Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: format = GGUF V3 (latest) Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: arch = deepseek2 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: vocab type = BPE Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_vocab = 102400 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_merges = 99757 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_train = 163840 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd = 5120 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_head_kv = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer = 60 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_rot = 64 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_k = 192 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_head_v = 128 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_gqa = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_k_gqa = 24576 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_embd_v_gqa = 16384 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_eps = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: f_logit_scale = 0.0e+00 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff = 12288 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert = 160 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_used = 6 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: causal attn = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: pooling type = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope type = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope scaling = yarn Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_base_train = 10000.0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: freq_scale_train = 0.025 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ctx_orig_yarn = 4096 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_finetuned = unknown Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_conv = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_inner = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_d_state = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: ssm_dt_rank = 0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model type = 236B Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model ftype = Q4_0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model params = 235.74 B Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: model size = 123.78 GiB (4.51 BPW) Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: general.name = DeepSeek-Coder-V2-Instruct Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: BOS token = 100000 '<｜begin▁of▁sentence｜>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: EOS token = 100001 '<｜end▁of▁sentence｜>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: PAD token = 100001 '<｜end▁of▁sentence｜>' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: LF token = 126 'Ä' Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_layer_dense_lead = 1 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_q = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_lora_kv = 512 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_ff_exp = 1536 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: n_expert_shared = 2 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: expert_weights_scale = 16.0 Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_print_meta: rope_yarn_log_mul = 0.1000 Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no Jul 06 19:23:15 ubuntux ollama[169742]: ggml_cuda_init: found 1 CUDA devices: Jul 06 19:23:15 ubuntux ollama[169742]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Jul 06 19:23:15 ubuntux ollama[169742]: llm_load_tensors: ggml ctx size = 0.87 MiB Jul 06 19:23:16 ubuntux ollama[169742]: time=2024-07-06T19:23:16.992-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:19 ubuntux ollama[169742]: time=2024-07-06T19:23:19.191-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloading 10 repeating layers to GPU Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: offloaded 10/61 layers to GPU Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CPU buffer size = 105416.00 MiB Jul 06 19:23:19 ubuntux ollama[169742]: llm_load_tensors: CUDA0 buffer size = 21335.35 MiB Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ctx = 2048 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_batch = 512 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: n_ubatch = 512 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: flash_attn = 0 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_base = 10000.0 Jul 06 19:23:21 ubuntux ollama[169742]: llama_new_context_with_model: freq_scale = 0.025 Jul 06 19:23:21 ubuntux ollama[169742]: time=2024-07-06T19:23:21.904-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:23 ubuntux ollama[169742]: time=2024-07-06T19:23:23.601-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA_Host KV buffer size = 8000.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: KV self size = 9600.00 MiB, K (f16): 5760.00 MiB, V (f16): 3840.00 MiB Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: CUDA_Host output buffer size = 0.41 MiB Jul 06 19:23:24 ubuntux ollama[169742]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 842.00 MiB on device 0: cudaMalloc failed: out of memory Jul 06 19:23:24 ubuntux ollama[169742]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 882903040 Jul 06 19:23:24 ubuntux ollama[169742]: llama_new_context_with_model: failed to allocate compute buffers Jul 06 19:23:25 ubuntux ollama[169742]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.314-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server not responding" Jul 06 19:23:26 ubuntux ollama[584320]: ERROR [load_model] unable to load model | model="/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408" tid="140507113369600" timestamp=1720308206 Jul 06 19:23:26 ubuntux ollama[169742]: terminate called without an active exception Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.566-04:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" Jul 06 19:23:26 ubuntux ollama[169742]: time=2024-07-06T19:23:26.817-04:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'" Jul 06 19:23:26 ubuntux ollama[169742]: [GIN] 2024/07/06 - 19:23:26 | 500 | 11.766669885s | 127.0.0.1 | POST "/api/chat" Jul 06 19:23:31 ubuntux ollama[169742]: time=2024-07-06T19:23:31.944-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.127181114 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.194-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376988446 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 Jul 06 19:23:32 ubuntux ollama[169742]: time=2024-07-06T19:23:32.444-04:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.626674402 model=/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408 ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48

GiteaMirror added the bug label 2026-05-03 21:27:28 -05:00

GiteaMirror commented

2026-05-03 21:27:34 -05:00

@olumolu commented on GitHub (Jul 8, 2024):

i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely

@olumolu commented on GitHub (Jul 8, 2024): i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely

GiteaMirror commented

2026-05-03 21:27:36 -05:00

@scouzi1966 commented on GitHub (Jul 8, 2024):

i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely

My issue is with the 236b model. Quite a large difference with the 16b

@scouzi1966 commented on GitHub (Jul 8, 2024): > i have tried 16b on alma linux with xeon processor one motherboard and 16gb main memory i could run deepseek-v2 16b nicely My issue is with the 236b model. Quite a large difference with the 16b

GiteaMirror commented

2026-05-03 21:27:36 -05:00

@Ramzee-S commented on GitHub (Jul 13, 2024):

Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error.
ollama run deepseek-coder-v2:236b
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it.
ollama version is 0.1.47 enough local disk space too.
Any help would be appreciated.

@Ramzee-S commented on GitHub (Jul 13, 2024): Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error. ollama run deepseek-coder-v2:236b Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it. ollama version is 0.1.47 enough local disk space too. Any help would be appreciated.

GiteaMirror commented

2026-05-03 21:27:37 -05:00

@scouzi1966 commented on GitHub (Jul 15, 2024):

Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error.
ollama run deepseek-coder-v2:236b
Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408'
Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it.
ollama version is 0.1.47 enough local disk space too.
Any help would be appreciated.

How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?

@scouzi1966 commented on GitHub (Jul 15, 2024): > Sorry not of much help, but i have a similar issue. when i disable my gpu's 2x RTX 3090 i can run the model in main memory (516GB (16Channel x32GB)) and it runs with tolerable speed (2x xeon 8470). Although initial prompt processing takes a while. However when the gpu's are enabled, i get the error. > ollama run deepseek-coder-v2:236b > Error: llama runner process has terminated: signal: aborted (core dumped) error:failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-6bbfda8eb96dadd0300076196110f78ff709829c3be9778e86948b839cf05408' > Other models that won't fit in VRAM just run fine partially in vram and partially in ram. But this one does not seem to do it. > ollama version is 0.1.47 enough local disk space too. > Any help would be appreciated. How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?

GiteaMirror commented

2026-05-03 21:27:37 -05:00

@Ramzee-S commented on GitHub (Jul 15, 2024):

How do you get Ollama to ignore your GPU? Or how do you disable it on Linux?

By accident my ubuntu nvidia driver got updated, a few days ago, then i needed a reboot to get the nvidia drivers working again. So when i did nvidia-smi nothing was showing up except some message that versions did not match.
Then i tried the deepseek-coder-v2 model and it ran! it was actually quite good. After reboot it stopped working. I tried to replicate it by removing the physical gpu's and then the model works. Then putting back the gpu's it of course did not work again. Trying to achieve a way to disable gpu temporary by some Environment variables, did not succeed yet. So i have no easy fix for this, and am using other models until maybe a new version of ollama will run this model.

Edit OK found a fix from here
You can disable the gpu's on linux (works on ubuntu 22) with
determine the device index using or nvidia-smi but better:

lspci | grep NVIDIA

Then to deactivate use the device id instead of the '0000:xx:00.0' string in the following

nvidia-smi -i 0000:xx:00.0 -pm 0
nvidia-smi drain -p 0000:xx:00.0 -m 1

to activate use: "```
nvidia-smi drain -p 0000:xx:00.0 -m 0" replacing the idea agan.
Now for me deepseek is working again. But of course partial gpu ofloading would be nicer. Because just an an example:
total duration: 3m24.609638164s
load duration: 46.016846ms
prompt eval count: 424 token(s)
prompt eval duration: 26.198928s
prompt eval rate: 16.18 tokens/s
eval count: 525 token(s)
eval duration: 2m58.207292s
eval rate: 2.95 tokens/s

@Ramzee-S commented on GitHub (Jul 15, 2024): > How do you get Ollama to ignore your GPU? Or how do you disable it on Linux? By accident my ubuntu nvidia driver got updated, a few days ago, then i needed a reboot to get the nvidia drivers working again. So when i did nvidia-smi nothing was showing up except some message that versions did not match. Then i tried the deepseek-coder-v2 model and it ran! it was actually quite good. After reboot it stopped working. I tried to replicate it by removing the physical gpu's and then the model works. Then putting back the gpu's it of course did not work again. Trying to achieve a way to disable gpu temporary by some Environment variables, did not succeed yet. So i have no easy fix for this, and am using other models until maybe a new version of ollama will run this model. Edit OK found a fix from [here](https://unix.stackexchange.com/questions/654075/how-can-i-disable-and-later-re-enable-one-of-my-nvidia-gpus) You can disable the gpu's on linux (works on ubuntu 22) with determine the device index using or `nvidia-smi` but better: ``` lspci | grep NVIDIA ``` Then to deactivate use the device id instead of the '0000:xx:00.0' string in the following ``` nvidia-smi -i 0000:xx:00.0 -pm 0 nvidia-smi drain -p 0000:xx:00.0 -m 1 ``` to activate use: "``` nvidia-smi drain -p 0000:xx:00.0 -m 0" replacing the idea agan. Now for me deepseek is working again. But of course partial gpu ofloading would be nicer. Because just an an example: total duration: 3m24.609638164s load duration: 46.016846ms prompt eval count: 424 token(s) prompt eval duration: 26.198928s prompt eval rate: 16.18 tokens/s eval count: 525 token(s) eval duration: 2m58.207292s eval rate: 2.95 tokens/s

GiteaMirror commented

2026-05-03 21:27:38 -05:00

@dhiltgen commented on GitHub (Jul 24, 2024):

Trying to achieve a way to disable gpu temporary by some Environment variables

You can use OLLAMA_LLM_LIBRARY to force a CPU based runner (e.g. OLLAMA_LLM_LIBRARY=cpu_avx2)

I've also posted a PR #5922 to add a new GPU overhead setting to bring back a viable workaround when the memory predictions are incorrect.

@dhiltgen commented on GitHub (Jul 24, 2024): > Trying to achieve a way to disable gpu temporary by some Environment variables You can use `OLLAMA_LLM_LIBRARY` to force a CPU based runner (e.g. `OLLAMA_LLM_LIBRARY=cpu_avx2`) I've also posted a PR #5922 to add a new GPU overhead setting to bring back a viable workaround when the memory predictions are incorrect.

GiteaMirror commented

2026-05-03 21:27:38 -05:00

@gsoul commented on GitHub (Aug 27, 2024):

I'm experiencing the same issue as the topic-starter. Is there a chance someone figured out the workaround for now?

@gsoul commented on GitHub (Aug 27, 2024): I'm experiencing the same issue as the topic-starter. Is there a chance someone figured out the workaround for now?

GiteaMirror commented

2026-05-03 21:27:39 -05:00

@pftg commented on GitHub (Sep 10, 2024):

The same issue for macOS

@pftg commented on GitHub (Sep 10, 2024): The same issue for macOS

GiteaMirror commented

2026-05-03 21:27:40 -05:00

@gsoul commented on GitHub (Sep 11, 2024):

It's fixed for me now. @pftg try to upgrade to the latest version and play around with OLLAMA_GPU_OVERHEAD env parameter: https://github.com/ollama/ollama/pull/5922

@gsoul commented on GitHub (Sep 11, 2024): It's fixed for me now. @pftg try to upgrade to the latest version and play around with OLLAMA_GPU_OVERHEAD env parameter: https://github.com/ollama/ollama/pull/5922

GiteaMirror commented

2026-05-03 21:27:40 -05:00

@pftg commented on GitHub (Sep 11, 2024):

@gsoul I tried OLLAMA_GPU_OVERHEAD for deepseek-v2:236b and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.

@pftg commented on GitHub (Sep 11, 2024): @gsoul I tried OLLAMA_GPU_OVERHEAD for `deepseek-v2:236b` and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.

GiteaMirror commented

2026-05-03 21:27:41 -05:00

@olumolu commented on GitHub (Sep 11, 2024):

@gsoul I tried OLLAMA_GPU_OVERHEAD for deepseek-v2:236b and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version.

How much ram you have?

@olumolu commented on GitHub (Sep 11, 2024): > @gsoul I tried OLLAMA_GPU_OVERHEAD for `deepseek-v2:236b` and still haven't found success. I'm using the default version for now. I believe my laptop has insufficient memory for a larger version. How much ram you have?

GiteaMirror commented

2026-05-03 21:27:41 -05:00

@pftg commented on GitHub (Sep 11, 2024):

@olumolu 16GB

@pftg commented on GitHub (Sep 11, 2024): @olumolu 16GB

GiteaMirror commented

2026-05-03 21:27:42 -05:00

@olumolu commented on GitHub (Sep 11, 2024):

@olumolu 16GB

No with 16 gb you cant even run gemma 27b
128 gb ram minimum with arch without desktop environment and with zswap or vram can barely run that model. To run that 160-196gb ram is recommended yo run that model.

@olumolu commented on GitHub (Sep 11, 2024): > @olumolu 16GB No with 16 gb you cant even run gemma 27b 128 gb ram minimum with arch without desktop environment and with zswap or vram can barely run that model. To run that 160-196gb ram is recommended yo run that model.

GiteaMirror commented

2026-05-03 21:27:43 -05:00

@Ramzee-S commented on GitHub (Sep 11, 2024):

I am quite sure that with 16gb RAM or VRAM you wont be able to run deepseek 236b any productive way. (The model size in ram is 133GB. and a big fast ssd swap will just also be to slow). The issues above also occured with 512GB of ram and 2x rtx 3090 with total 48GB VRAM, and users with 196GB RAM.
The issues above also seem related to this thread on llama.cpp.
https://github.com/ggerganov/llama.cpp/discussions/8520
Running deepseek 236gb in llama.cpp directly also gave issues, and i think these were related to some of the ollama issues some are experiencing. When running llama.cpp by default a fixed fraction of the model size was used as a context size multiplier.
Which resulted in very high memory allocation when loading/starting the model, which failed in some cases. Basically you could have enough memory for the model, but not for the default allocated context size. If smaller context window was manually allocated then things worked with llama.cpp. I would not be surprized if setting lower context defaults in ollama will fix some of the issues. However some other issues "things working without cpu, and not with gpu", seems a seperate different issue.

@Ramzee-S commented on GitHub (Sep 11, 2024): I am quite sure that with 16gb RAM or VRAM you wont be able to run deepseek 236b any productive way. (The model size in ram is 133GB. and a big fast ssd swap will just also be to slow). The issues above also occured with 512GB of ram and 2x rtx 3090 with total 48GB VRAM, and users with 196GB RAM. The issues above also seem related to this thread on llama.cpp. https://github.com/ggerganov/llama.cpp/discussions/8520 Running deepseek 236gb in llama.cpp directly also gave issues, and i think these were related to some of the ollama issues some are experiencing. When running llama.cpp by default a fixed fraction of the model size was used as a context size multiplier. Which resulted in very high memory allocation when loading/starting the model, which failed in some cases. Basically you could have enough memory for the model, but not for the default allocated context size. If smaller context window was manually allocated then things worked with llama.cpp. I would not be surprized if setting lower context defaults in ollama will fix some of the issues. However some other issues "things working without cpu, and not with gpu", seems a seperate different issue.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#65486