[GH-ISSUE #14429] 7B models not running on v0.17.0 Error: llama runner process has terminated: exit status 2 #71428

Closed
opened 2026-05-05 01:38:11 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @kaifhm on GitHub (Feb 26, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14429

What is the issue?

I upgraded from v0.6.6 to v0.17.0 and my 7b models cannot be loaded anymore. I get Error: llama runner process has terminated: exit status 2. Similar to #8770 . I don't think I should be tinkering with parameters to make this work since previous version clearly didn't need that.

Relevant log output

time=2026-02-25T07:35:07.656+05:30 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 34227"
llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 7B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-7B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 7.62 B
print_info: general.name     = Qwen2.5 7B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2026-02-25T07:35:09.011+05:30 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 --port 39315"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=sched.go:491 msg="system memory" total="15.5 GiB" free="9.0 GiB" free_swap="6.1 GiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f library=CUDA available="1.5 GiB" free="1.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="568.8 MiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.5 GiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="32.0 MiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:256 msg="kv cache" device=CPU size="192.0 MiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="730.4 MiB"
time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:272 msg="total memory" size="5.0 GiB"
time=2026-02-25T07:35:09.026+05:30 level=INFO source=runner.go:965 msg="starting go runner"
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce MX110, compute capability 5.0, VMM: yes, ID: GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f
load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-02-25T07:35:09.285+05:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-25T07:35:09.287+05:30 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:39315"
time=2026-02-25T07:35:09.290+05:30 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Auto KvSize:4096 KvCacheType: NumThreads:4 GPULayers:4[ID:GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f Layers:4(24..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2026-02-25T07:35:09.291+05:30 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-25T07:35:09.292+05:30 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
ggml_backend_cuda_device_get_memory device GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f utilizing NVML memory reporting free: 2085683200 total: 2147483648
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce MX110) (0000:01:00.0) - 1989 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-7...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 7B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-7B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW)
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: no_alloc         = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_embd_inp       = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = Qwen2.5 7B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 4 repeating layers to GPU
load_tensors: offloaded 4/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB
load_tensors:        CUDA0 model buffer size =   568.82 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.59 MiB
llama_kv_cache:        CPU KV buffer size =   192.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =    32.00 MiB
llama_kv_cache: size =  224.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context: Flash Attention was auto, set to enabled
graph_reserve: failed to allocate compute buffers
SIGSEGV: segmentation violation
PC=0x74dbbb538481 m=9 sigcode=1 addr=0x74e50cf3fe78
signal arrived during cgo execution
goroutine 14 gp=0xc000102fc0 m=9 mp=0xc000680008 [syscall]:
runtime.cgocall(0x5b56d27aa380, 0xc000187c00)
        runtime/cgocall.go:167 +0x4b fp=0xc000187bd8 sp=0xc000187ba0 pc=0x5b56d186facb
github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x74dc20000c70, {0x1000, 0x200, 0x200, 0x1, 0x4, 0x4, 0xffffffff, 0xffffffff, 0xffffffff, ...})
        _cgo_gotypes.go:767 +0x4e fp=0xc000187c00 sp=0xc000187bd8 pc=0x5b56d1d058ae
github.com/ollama/ollama/llama.NewContextWithModel.func1(...)
        github.com/ollama/ollama/llama/llama.go:322
github.com/ollama/ollama/llama.NewContextWithModel(0xc000432020, {{0x1000, 0x200, 0x200, 0x1, 0x4, 0x4, 0xffffffff, 0xffffffff, 0xffffffff, ...}})
        github.com/ollama/ollama/llama/llama.go:322 +0x158 fp=0xc000187da0 sp=0xc000187c00 pc=0x5b56d1d09b18
github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004a8280, {{0xc0003d4810, 0x1, 0x1}, 0x4, 0x0, 0x1, {0xc0003d4808, 0x1, 0x2}, ...}, ...)
        github.com/ollama/ollama/runner/llamarunner/runner.go:847 +0x178 fp=0xc000187ee8 sp=0xc000187da0 pc=0x5b56d1dbb318
github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2()
        github.com/ollama/ollama/runner/llamarunner/runner.go:934 +0x114 fp=0xc000187fe0 sp=0xc000187ee8 pc=0x5b56d1dbc534
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000187fe8 sp=0xc000187fe0 pc=0x5b56d187aec1
created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 11
        github.com/ollama/ollama/runner/llamarunner/runner.go:934 +0x889
goroutine 1 gp=0xc000002380 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000131778 sp=0xc000131758 pc=0x5b56d1872f4e
runtime.netpollblock(0xc00051d7c8?, 0xd180c506?, 0x56?)
        runtime/netpoll.go:575 +0xf7 fp=0xc0001317b0 sp=0xc000131778 pc=0x5b56d18380f7
internal/poll.runtime_pollWait(0x74dc8eaa96d0, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc0001317d0 sp=0xc0001317b0 pc=0x5b56d1872165
internal/poll.(*pollDesc).wait(0xc0004b2180?, 0x900000036?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001317f8 sp=0xc0001317d0 pc=0x5b56d18fa487
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0004b2180)
        internal/poll/fd_unix.go:620 +0x295 fp=0xc0001318a0 sp=0xc0001317f8 pc=0x5b56d18ff855
net.(*netFD).accept(0xc0004b2180)
        net/fd_unix.go:172 +0x29 fp=0xc000131958 sp=0xc0001318a0 pc=0x5b56d1972d49
net.(*TCPListener).accept(0xc0004b00c0)
        net/tcpsock_posix.go:159 +0x1b fp=0xc0001319a8 sp=0xc000131958 pc=0x5b56d1988c5b
net.(*TCPListener).Accept(0xc0004b00c0)
        net/tcpsock.go:380 +0x30 fp=0xc0001319d8 sp=0xc0001319a8 pc=0x5b56d1987b10
net/http.(*onceCloseListener).Accept(0xc0004d4090?)
        <autogenerated>:1 +0x24 fp=0xc0001319f0 sp=0xc0001319d8 pc=0x5b56d1b9fac4
net/http.(*Server).Serve(0xc0004b4100, {0x5b56d30d9020, 0xc0004b00c0})
        net/http/server.go:3424 +0x30c fp=0xc000131b20 sp=0xc0001319f0 pc=0x5b56d1b7738c
github.com/ollama/ollama/runner/llamarunner.Execute({0xc000034260, 0x4, 0x4})
        github.com/ollama/ollama/runner/llamarunner/runner.go:1002 +0x8f5 fp=0xc000131cf0 sp=0xc000131b20 pc=0x5b56d1dbcef5
github.com/ollama/ollama/runner.Execute({0xc000034250?, 0x0?, 0x0?})
        github.com/ollama/ollama/runner/runner.go:25 +0x190 fp=0xc000131d30 sp=0xc000131cf0 pc=0x5b56d1f26630
github.com/ollama/ollama/cmd.NewCLI.func3(0xc000201800?, {0x5b56d2b01239?, 0x4?, 0x5b56d2b0123d?})
        github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000131d58 sp=0xc000131d30 pc=0x5b56d2729b65
github.com/spf13/cobra.(*Command).execute(0xc00057fb08, {0xc00055ee40, 0x4, 0x4})
        github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000131e78 sp=0xc000131d58 pc=0x5b56d19eccdc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000558908)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000131f30 sp=0xc000131e78 pc=0x5b56d19ed525
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
        github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000131f50 sp=0xc000131f30 pc=0x5b56d272c00d
runtime.main()
        runtime/proc.go:283 +0x29d fp=0xc000131fe0 sp=0xc000131f50 pc=0x5b56d183f77d
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000131fe8 sp=0xc000131fe0 pc=0x5b56d187aec1
goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x5b56d1872f4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.forcegchelper()
        runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x5b56d183fab8
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x5b56d187aec1
created by runtime.init.7 in goroutine 1
        runtime/proc.go:336 +0x1a
goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x5b56d1872f4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.bgsweep(0xc00007e000)
        runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x5b56d182a25f
runtime.gcenable.gowrap1()
        runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x5b56d181e645
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x5b56d187aec1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:204 +0x66
goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0x10000?, 0x5b56d2d14150?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x5b56d1872f4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.(*scavengerState).park(0x5b56d3afc3c0)
        runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x5b56d1827ca9
runtime.bgscavenge(0xc00007e000)
        runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x5b56d1828239
runtime.gcenable.gowrap2()
        runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x5b56d181e5e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x5b56d187aec1
created by runtime.gcenable in goroutine 1
        runtime/mgc.go:205 +0xa5
goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?)
        runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x5b56d1872f4e
runtime.runfinq()
        runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x5b56d181d607
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x5b56d187aec1
created by runtime.createfing in goroutine 1
        runtime/mfinal.go:166 +0x3d
goroutine 6 gp=0xc0001e08c0 m=nil [chan receive]:
runtime.gopark(0xc000231ae0?, 0xc000334030?, 0x60?, 0x47?, 0x5b56d19598a8?)
        runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x5b56d1872f4e
runtime.chanrecv(0xc0000a8310, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x5b56d180f0e5
runtime.chanrecv1(0x0?, 0x0?)
        runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x5b56d180ec72
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
        runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
        runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x5b56d18217ef
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x5b56d187aec1
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
        runtime/mgc.go:1794 +0x85
goroutine 7 gp=0xc0001e0e00 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 8 gp=0xc0001e0fc0 m=nil [GC worker (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 9 gp=0xc0001e1180 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d738cb5e?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000075f38 sp=0xc000075f18 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc000075fc8 sp=0xc000075f38 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 18 gp=0xc000102380 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d738fc85?, 0x3?, 0xbd?, 0x99?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006e738 sp=0xc00006e718 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006e7c8 sp=0xc00006e738 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 19 gp=0xc000102540 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d738c5fb?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 20 gp=0xc000102700 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d7381a46?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 21 gp=0xc0001028c0 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d738b3f1?, 0x3?, 0xf4?, 0x52?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 34 gp=0xc000504000 m=nil [GC worker (idle)]:
runtime.gopark(0x3308d7381039?, 0x0?, 0x0?, 0x0?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00050a738 sp=0xc00050a718 pc=0x5b56d1872f4e
runtime.gcBgMarkWorker(0xc0000a9730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00050a7c8 sp=0xc00050a738 pc=0x5b56d1820b09
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00050a7e0 sp=0xc00050a7c8 pc=0x5b56d18209e5
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00050a7e8 sp=0xc00050a7e0 pc=0x5b56d187aec1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105
goroutine 10 gp=0xc000102a80 m=nil [sync.WaitGroup.Wait]:
runtime.gopark(0x0?, 0x0?, 0x60?, 0x40?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00050c620 sp=0xc00050c600 pc=0x5b56d1872f4e
runtime.goparkunlock(...)
        runtime/proc.go:441
runtime.semacquire1(0xc0004a82a0, 0x0, 0x1, 0x0, 0x18)
        runtime/sema.go:188 +0x229 fp=0xc00050c688 sp=0xc00050c620 pc=0x5b56d1852d49
sync.runtime_SemacquireWaitGroup(0x0?)
        runtime/sema.go:110 +0x25 fp=0xc00050c6c0 sp=0xc00050c688 pc=0x5b56d1874885
sync.(*WaitGroup).Wait(0x0?)
        sync/waitgroup.go:118 +0x48 fp=0xc00050c6e8 sp=0xc00050c6c0 pc=0x5b56d1886928
github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0004a8280, {0x5b56d30db8c0, 0xc0004b60f0})
        github.com/ollama/ollama/runner/llamarunner/runner.go:360 +0x4b fp=0xc00050c7b8 sp=0xc00050c6e8 pc=0x5b56d1db7c0b
github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1()
        github.com/ollama/ollama/runner/llamarunner/runner.go:981 +0x28 fp=0xc00050c7e0 sp=0xc00050c7b8 pc=0x5b56d1dbd168
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00050c7e8 sp=0xc00050c7e0 pc=0x5b56d187aec1
created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/llamarunner/runner.go:981 +0x4c5
goroutine 11 gp=0xc000102c40 m=nil [IO wait]:
runtime.gopark(0x74dc8ea61ae8?, 0xc0004b2200?, 0x70?, 0x99?, 0xb?)
        runtime/proc.go:435 +0xce fp=0xc0004d9948 sp=0xc0004d9928 pc=0x5b56d1872f4e
runtime.netpollblock(0x5b56d18967f8?, 0xd180c506?, 0x56?)
        runtime/netpoll.go:575 +0xf7 fp=0xc0004d9980 sp=0xc0004d9948 pc=0x5b56d18380f7
internal/poll.runtime_pollWait(0x74dc8eaa95b8, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc0004d99a0 sp=0xc0004d9980 pc=0x5b56d1872165
internal/poll.(*pollDesc).wait(0xc0004b2200?, 0xc0004d6000?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004d99c8 sp=0xc0004d99a0 pc=0x5b56d18fa487
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0004b2200, {0xc0004d6000, 0x1000, 0x1000})
        internal/poll/fd_unix.go:165 +0x27a fp=0xc0004d9a60 sp=0xc0004d99c8 pc=0x5b56d18fb77a
net.(*netFD).Read(0xc0004b2200, {0xc0004d6000?, 0xc0004d9ad0?, 0x5b56d18fa945?})
        net/fd_posix.go:55 +0x25 fp=0xc0004d9aa8 sp=0xc0004d9a60 pc=0x5b56d1970da5
net.(*conn).Read(0xc000076200, {0xc0004d6000?, 0x0?, 0x0?})
        net/net.go:194 +0x45 fp=0xc0004d9af0 sp=0xc0004d9aa8 pc=0x5b56d197f165
net/http.(*connReader).Read(0xc0004ac540, {0xc0004d6000, 0x1000, 0x1000})
        net/http/server.go:798 +0x159 fp=0xc0004d9b40 sp=0xc0004d9af0 pc=0x5b56d1b6c239
bufio.(*Reader).fill(0xc0006902a0)
        bufio/bufio.go:113 +0x103 fp=0xc0004d9b78 sp=0xc0004d9b40 pc=0x5b56d1997223
bufio.(*Reader).Peek(0xc0006902a0, 0x4)
        bufio/bufio.go:152 +0x53 fp=0xc0004d9b98 sp=0xc0004d9b78 pc=0x5b56d1997353
net/http.(*conn).serve(0xc0004d4090, {0x5b56d30db888, 0xc0004ac450})
        net/http/server.go:2137 +0x785 fp=0xc0004d9fb8 sp=0xc0004d9b98 pc=0x5b56d1b72025
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3454 +0x28 fp=0xc0004d9fe0 sp=0xc0004d9fb8 pc=0x5b56d1b77788
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0004d9fe8 sp=0xc0004d9fe0 pc=0x5b56d187aec1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3454 +0x485
rax    0x74e50cf3fe78
rbx    0x0
rcx    0x74da402e26b0
rdx    0x2
rdi    0x74dc200d0c50
rsi    0x0
rbp    0x74dc200ce980
rsp    0x74dc27ffea80
r8     0x74dc221e2
r9     0x7
r10    0x74dc221e2600
r11    0x708c9748f81c9a1c
r12    0x74dc2001a2b0
r13    0x0
r14    0x1
r15    0x74da405283b0
rip    0x74dbbb538481
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0
time=2026-02-25T07:35:11.502+05:30 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-02-25T07:35:11.753+05:30 level=INFO source=sched.go:518 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 error="llama runner process has terminated: exit status 2"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.17.0

Originally created by @kaifhm on GitHub (Feb 26, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14429 ### What is the issue? I upgraded from v0.6.6 to v0.17.0 and my 7b models cannot be loaded anymore. I get `Error: llama runner process has terminated: exit status 2`. Similar to #8770 . I don't think I should be tinkering with parameters to make this work since previous version clearly didn't need that. ### Relevant log output ```shell time=2026-02-25T07:35:07.656+05:30 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 34227" llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 7B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5 llama_model_loader: - kv 5: general.size_label str = 7B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 7B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-7B llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.36 GiB (4.91 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 1 print_info: no_alloc = 0 print_info: model type = ?B print_info: model params = 7.62 B print_info: general.name = Qwen2.5 7B Instruct print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2026-02-25T07:35:09.011+05:30 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 --port 39315" time=2026-02-25T07:35:09.011+05:30 level=INFO source=sched.go:491 msg="system memory" total="15.5 GiB" free="9.0 GiB" free_swap="6.1 GiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f library=CUDA available="1.5 GiB" free="1.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-25T07:35:09.011+05:30 level=INFO source=server.go:498 msg="loading model" "model layers"=29 requested=-1 time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="568.8 MiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:245 msg="model weights" device=CPU size="3.5 GiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="32.0 MiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:256 msg="kv cache" device=CPU size="192.0 MiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="730.4 MiB" time=2026-02-25T07:35:09.011+05:30 level=INFO source=device.go:272 msg="total memory" size="5.0 GiB" time=2026-02-25T07:35:09.026+05:30 level=INFO source=runner.go:965 msg="starting go runner" load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce MX110, compute capability 5.0, VMM: yes, ID: GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so time=2026-02-25T07:35:09.285+05:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-25T07:35:09.287+05:30 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:39315" time=2026-02-25T07:35:09.290+05:30 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Auto KvSize:4096 KvCacheType: NumThreads:4 GPULayers:4[ID:GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f Layers:4(24..27)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" time=2026-02-25T07:35:09.291+05:30 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-25T07:35:09.292+05:30 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" ggml_backend_cuda_device_get_memory device GPU-d5ef2446-caa5-56c8-7192-a20e3d31269f utilizing NVML memory reporting free: 2085683200 total: 2147483648 llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce MX110) (0000:01:00.0) - 1989 MiB free llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 7B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5 llama_model_loader: - kv 5: general.size_label str = 7B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-7... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 7B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-7B llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 28 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_K: 169 tensors llama_model_loader: - type q6_K: 29 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.36 GiB (4.91 BPW) load: printing all EOG tokens: load: - 151643 ('<|endoftext|>') load: - 151645 ('<|im_end|>') load: - 151662 ('<|fim_pad|>') load: - 151663 ('<|repo_name|>') load: - 151664 ('<|file_sep|>') load: special tokens cache size = 22 load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: no_alloc = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 3584 print_info: n_embd_inp = 3584 print_info: n_layer = 28 print_info: n_head = 28 print_info: n_head_kv = 4 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 7 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 18944 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = -1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_yarn_log_mul= 0.0000 print_info: rope_finetuned = unknown print_info: model type = 7B print_info: model params = 7.62 B print_info: general.name = Qwen2.5 7B Instruct print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 4 repeating layers to GPU load_tensors: offloaded 4/29 layers to GPU load_tensors: CPU_Mapped model buffer size = 4460.45 MiB load_tensors: CUDA0 model buffer size = 568.82 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = auto llama_context: kv_unified = false llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.59 MiB llama_kv_cache: CPU KV buffer size = 192.00 MiB llama_kv_cache: CUDA0 KV buffer size = 32.00 MiB llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB llama_context: Flash Attention was auto, set to enabled graph_reserve: failed to allocate compute buffers SIGSEGV: segmentation violation PC=0x74dbbb538481 m=9 sigcode=1 addr=0x74e50cf3fe78 signal arrived during cgo execution goroutine 14 gp=0xc000102fc0 m=9 mp=0xc000680008 [syscall]: runtime.cgocall(0x5b56d27aa380, 0xc000187c00) runtime/cgocall.go:167 +0x4b fp=0xc000187bd8 sp=0xc000187ba0 pc=0x5b56d186facb github.com/ollama/ollama/llama._Cfunc_llama_init_from_model(0x74dc20000c70, {0x1000, 0x200, 0x200, 0x1, 0x4, 0x4, 0xffffffff, 0xffffffff, 0xffffffff, ...}) _cgo_gotypes.go:767 +0x4e fp=0xc000187c00 sp=0xc000187bd8 pc=0x5b56d1d058ae github.com/ollama/ollama/llama.NewContextWithModel.func1(...) github.com/ollama/ollama/llama/llama.go:322 github.com/ollama/ollama/llama.NewContextWithModel(0xc000432020, {{0x1000, 0x200, 0x200, 0x1, 0x4, 0x4, 0xffffffff, 0xffffffff, 0xffffffff, ...}}) github.com/ollama/ollama/llama/llama.go:322 +0x158 fp=0xc000187da0 sp=0xc000187c00 pc=0x5b56d1d09b18 github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004a8280, {{0xc0003d4810, 0x1, 0x1}, 0x4, 0x0, 0x1, {0xc0003d4808, 0x1, 0x2}, ...}, ...) github.com/ollama/ollama/runner/llamarunner/runner.go:847 +0x178 fp=0xc000187ee8 sp=0xc000187da0 pc=0x5b56d1dbb318 github.com/ollama/ollama/runner/llamarunner.(*Server).load.gowrap2() github.com/ollama/ollama/runner/llamarunner/runner.go:934 +0x114 fp=0xc000187fe0 sp=0xc000187ee8 pc=0x5b56d1dbc534 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000187fe8 sp=0xc000187fe0 pc=0x5b56d187aec1 created by github.com/ollama/ollama/runner/llamarunner.(*Server).load in goroutine 11 github.com/ollama/ollama/runner/llamarunner/runner.go:934 +0x889 goroutine 1 gp=0xc000002380 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000131778 sp=0xc000131758 pc=0x5b56d1872f4e runtime.netpollblock(0xc00051d7c8?, 0xd180c506?, 0x56?) runtime/netpoll.go:575 +0xf7 fp=0xc0001317b0 sp=0xc000131778 pc=0x5b56d18380f7 internal/poll.runtime_pollWait(0x74dc8eaa96d0, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0001317d0 sp=0xc0001317b0 pc=0x5b56d1872165 internal/poll.(*pollDesc).wait(0xc0004b2180?, 0x900000036?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0001317f8 sp=0xc0001317d0 pc=0x5b56d18fa487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0004b2180) internal/poll/fd_unix.go:620 +0x295 fp=0xc0001318a0 sp=0xc0001317f8 pc=0x5b56d18ff855 net.(*netFD).accept(0xc0004b2180) net/fd_unix.go:172 +0x29 fp=0xc000131958 sp=0xc0001318a0 pc=0x5b56d1972d49 net.(*TCPListener).accept(0xc0004b00c0) net/tcpsock_posix.go:159 +0x1b fp=0xc0001319a8 sp=0xc000131958 pc=0x5b56d1988c5b net.(*TCPListener).Accept(0xc0004b00c0) net/tcpsock.go:380 +0x30 fp=0xc0001319d8 sp=0xc0001319a8 pc=0x5b56d1987b10 net/http.(*onceCloseListener).Accept(0xc0004d4090?) <autogenerated>:1 +0x24 fp=0xc0001319f0 sp=0xc0001319d8 pc=0x5b56d1b9fac4 net/http.(*Server).Serve(0xc0004b4100, {0x5b56d30d9020, 0xc0004b00c0}) net/http/server.go:3424 +0x30c fp=0xc000131b20 sp=0xc0001319f0 pc=0x5b56d1b7738c github.com/ollama/ollama/runner/llamarunner.Execute({0xc000034260, 0x4, 0x4}) github.com/ollama/ollama/runner/llamarunner/runner.go:1002 +0x8f5 fp=0xc000131cf0 sp=0xc000131b20 pc=0x5b56d1dbcef5 github.com/ollama/ollama/runner.Execute({0xc000034250?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:25 +0x190 fp=0xc000131d30 sp=0xc000131cf0 pc=0x5b56d1f26630 github.com/ollama/ollama/cmd.NewCLI.func3(0xc000201800?, {0x5b56d2b01239?, 0x4?, 0x5b56d2b0123d?}) github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000131d58 sp=0xc000131d30 pc=0x5b56d2729b65 github.com/spf13/cobra.(*Command).execute(0xc00057fb08, {0xc00055ee40, 0x4, 0x4}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000131e78 sp=0xc000131d58 pc=0x5b56d19eccdc github.com/spf13/cobra.(*Command).ExecuteC(0xc000558908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000131f30 sp=0xc000131e78 pc=0x5b56d19ed525 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000131f50 sp=0xc000131f30 pc=0x5b56d272c00d runtime.main() runtime/proc.go:283 +0x29d fp=0xc000131fe0 sp=0xc000131f50 pc=0x5b56d183f77d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000131fe8 sp=0xc000131fe0 pc=0x5b56d187aec1 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x5b56d1872f4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x5b56d183fab8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x5b56d187aec1 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x5b56d1872f4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc00007e000) runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x5b56d182a25f runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x5b56d181e645 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x5b56d187aec1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]: runtime.gopark(0x10000?, 0x5b56d2d14150?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x5b56d1872f4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x5b56d3afc3c0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x5b56d1827ca9 runtime.bgscavenge(0xc00007e000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x5b56d1828239 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x5b56d181e5e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x5b56d187aec1 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 5 gp=0xc000003dc0 m=nil [finalizer wait]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?) runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x5b56d1872f4e runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x5b56d181d607 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x5b56d187aec1 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 6 gp=0xc0001e08c0 m=nil [chan receive]: runtime.gopark(0xc000231ae0?, 0xc000334030?, 0x60?, 0x47?, 0x5b56d19598a8?) runtime/proc.go:435 +0xce fp=0xc000074718 sp=0xc0000746f8 pc=0x5b56d1872f4e runtime.chanrecv(0xc0000a8310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000074790 sp=0xc000074718 pc=0x5b56d180f0e5 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc0000747b8 sp=0xc000074790 pc=0x5b56d180ec72 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc0000747e0 sp=0xc0000747b8 pc=0x5b56d18217ef runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x5b56d187aec1 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 7 gp=0xc0001e0e00 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 8 gp=0xc0001e0fc0 m=nil [GC worker (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 9 gp=0xc0001e1180 m=nil [GC worker (idle)]: runtime.gopark(0x3308d738cb5e?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000075f38 sp=0xc000075f18 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc000075fc8 sp=0xc000075f38 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000075fe0 sp=0xc000075fc8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000075fe8 sp=0xc000075fe0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 18 gp=0xc000102380 m=nil [GC worker (idle)]: runtime.gopark(0x3308d738fc85?, 0x3?, 0xbd?, 0x99?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006e738 sp=0xc00006e718 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc00006e7c8 sp=0xc00006e738 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 19 gp=0xc000102540 m=nil [GC worker (idle)]: runtime.gopark(0x3308d738c5fb?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 20 gp=0xc000102700 m=nil [GC worker (idle)]: runtime.gopark(0x3308d7381a46?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 21 gp=0xc0001028c0 m=nil [GC worker (idle)]: runtime.gopark(0x3308d738b3f1?, 0x3?, 0xf4?, 0x52?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 34 gp=0xc000504000 m=nil [GC worker (idle)]: runtime.gopark(0x3308d7381039?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00050a738 sp=0xc00050a718 pc=0x5b56d1872f4e runtime.gcBgMarkWorker(0xc0000a9730) runtime/mgc.go:1423 +0xe9 fp=0xc00050a7c8 sp=0xc00050a738 pc=0x5b56d1820b09 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00050a7e0 sp=0xc00050a7c8 pc=0x5b56d18209e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00050a7e8 sp=0xc00050a7e0 pc=0x5b56d187aec1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 10 gp=0xc000102a80 m=nil [sync.WaitGroup.Wait]: runtime.gopark(0x0?, 0x0?, 0x60?, 0x40?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00050c620 sp=0xc00050c600 pc=0x5b56d1872f4e runtime.goparkunlock(...) runtime/proc.go:441 runtime.semacquire1(0xc0004a82a0, 0x0, 0x1, 0x0, 0x18) runtime/sema.go:188 +0x229 fp=0xc00050c688 sp=0xc00050c620 pc=0x5b56d1852d49 sync.runtime_SemacquireWaitGroup(0x0?) runtime/sema.go:110 +0x25 fp=0xc00050c6c0 sp=0xc00050c688 pc=0x5b56d1874885 sync.(*WaitGroup).Wait(0x0?) sync/waitgroup.go:118 +0x48 fp=0xc00050c6e8 sp=0xc00050c6c0 pc=0x5b56d1886928 github.com/ollama/ollama/runner/llamarunner.(*Server).run(0xc0004a8280, {0x5b56d30db8c0, 0xc0004b60f0}) github.com/ollama/ollama/runner/llamarunner/runner.go:360 +0x4b fp=0xc00050c7b8 sp=0xc00050c6e8 pc=0x5b56d1db7c0b github.com/ollama/ollama/runner/llamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/llamarunner/runner.go:981 +0x28 fp=0xc00050c7e0 sp=0xc00050c7b8 pc=0x5b56d1dbd168 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00050c7e8 sp=0xc00050c7e0 pc=0x5b56d187aec1 created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/llamarunner/runner.go:981 +0x4c5 goroutine 11 gp=0xc000102c40 m=nil [IO wait]: runtime.gopark(0x74dc8ea61ae8?, 0xc0004b2200?, 0x70?, 0x99?, 0xb?) runtime/proc.go:435 +0xce fp=0xc0004d9948 sp=0xc0004d9928 pc=0x5b56d1872f4e runtime.netpollblock(0x5b56d18967f8?, 0xd180c506?, 0x56?) runtime/netpoll.go:575 +0xf7 fp=0xc0004d9980 sp=0xc0004d9948 pc=0x5b56d18380f7 internal/poll.runtime_pollWait(0x74dc8eaa95b8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0004d99a0 sp=0xc0004d9980 pc=0x5b56d1872165 internal/poll.(*pollDesc).wait(0xc0004b2200?, 0xc0004d6000?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0004d99c8 sp=0xc0004d99a0 pc=0x5b56d18fa487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0004b2200, {0xc0004d6000, 0x1000, 0x1000}) internal/poll/fd_unix.go:165 +0x27a fp=0xc0004d9a60 sp=0xc0004d99c8 pc=0x5b56d18fb77a net.(*netFD).Read(0xc0004b2200, {0xc0004d6000?, 0xc0004d9ad0?, 0x5b56d18fa945?}) net/fd_posix.go:55 +0x25 fp=0xc0004d9aa8 sp=0xc0004d9a60 pc=0x5b56d1970da5 net.(*conn).Read(0xc000076200, {0xc0004d6000?, 0x0?, 0x0?}) net/net.go:194 +0x45 fp=0xc0004d9af0 sp=0xc0004d9aa8 pc=0x5b56d197f165 net/http.(*connReader).Read(0xc0004ac540, {0xc0004d6000, 0x1000, 0x1000}) net/http/server.go:798 +0x159 fp=0xc0004d9b40 sp=0xc0004d9af0 pc=0x5b56d1b6c239 bufio.(*Reader).fill(0xc0006902a0) bufio/bufio.go:113 +0x103 fp=0xc0004d9b78 sp=0xc0004d9b40 pc=0x5b56d1997223 bufio.(*Reader).Peek(0xc0006902a0, 0x4) bufio/bufio.go:152 +0x53 fp=0xc0004d9b98 sp=0xc0004d9b78 pc=0x5b56d1997353 net/http.(*conn).serve(0xc0004d4090, {0x5b56d30db888, 0xc0004ac450}) net/http/server.go:2137 +0x785 fp=0xc0004d9fb8 sp=0xc0004d9b98 pc=0x5b56d1b72025 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc0004d9fe0 sp=0xc0004d9fb8 pc=0x5b56d1b77788 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004d9fe8 sp=0xc0004d9fe0 pc=0x5b56d187aec1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 rax 0x74e50cf3fe78 rbx 0x0 rcx 0x74da402e26b0 rdx 0x2 rdi 0x74dc200d0c50 rsi 0x0 rbp 0x74dc200ce980 rsp 0x74dc27ffea80 r8 0x74dc221e2 r9 0x7 r10 0x74dc221e2600 r11 0x708c9748f81c9a1c r12 0x74dc2001a2b0 r13 0x0 r14 0x1 r15 0x74da405283b0 rip 0x74dbbb538481 rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 time=2026-02-25T07:35:11.502+05:30 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" time=2026-02-25T07:35:11.753+05:30 level=INFO source=sched.go:518 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-2bada8a7450677000f678be90653b85d364de7db25eb5ea54136ada5f3933730 error="llama runner process has terminated: exit status 2" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.17.0
GiteaMirror added the bug label 2026-05-05 01:38:11 -05:00
Author
Owner

@arjunvenkatraman commented on GitHub (Mar 5, 2026):

I'm hitting the same issue on v0.17.6.

My setup: Ubuntu 24.04, NVIDIA GeForce 940MX (2GB VRAM), Intel CPU with AVX2, 12GB RAM (~7.4GB available), NVIDIA driver 580.126.09, CUDA 13.0.

Same behavior: 0.5b models work, all 3B+ models crash with exit status 2 — including qwen2.5-coder:3b and llama3.2:3b. Crash persists with OLLAMA_NUM_GPU=0 and CUDA_VISIBLE_DEVICES="".

Key finding that may help debug this: I built llama.cpp from source (b8210-2cd20b72e, -DGGML_CUDA=OFF) and ran the exact same Ollama model blob directly:

./build/bin/llama-cli -m /usr/share/ollama/.ollama/models/blobs/sha256-4a188102020e9c9530b687fd6400f775c45e90a0d7baafe65bd0a36963fbb7ba -ngl 0 -c 2048

Works perfectly at ~9.3 t/s. So the model file and hardware are fine — the bug appears to be in Ollama's runner.

<!-- gh-comment-id:4004896436 --> @arjunvenkatraman commented on GitHub (Mar 5, 2026): I'm hitting the same issue on v0.17.6. **My setup:** Ubuntu 24.04, NVIDIA GeForce 940MX (2GB VRAM), Intel CPU with AVX2, 12GB RAM (~7.4GB available), NVIDIA driver 580.126.09, CUDA 13.0. **Same behavior:** 0.5b models work, all 3B+ models crash with exit status 2 — including `qwen2.5-coder:3b` and `llama3.2:3b`. Crash persists with `OLLAMA_NUM_GPU=0` and `CUDA_VISIBLE_DEVICES=""`. **Key finding that may help debug this:** I built llama.cpp from source (b8210-2cd20b72e, `-DGGML_CUDA=OFF`) and ran the exact same Ollama model blob directly: ``` ./build/bin/llama-cli -m /usr/share/ollama/.ollama/models/blobs/sha256-4a188102020e9c9530b687fd6400f775c45e90a0d7baafe65bd0a36963fbb7ba -ngl 0 -c 2048 ``` Works perfectly at ~9.3 t/s. So the model file and hardware are fine — the bug appears to be in Ollama's runner.
Author
Owner

@rick-github commented on GitHub (Mar 12, 2026):

@kaifhm

graph_reserve: failed to allocate compute buffers
SIGSEGV: segmentation violation

By default the model is running on the llama.cpp engine which is sometimes inaccurate with memory estimation. Try setting OLLAMA_NEW_ENGINE=1 in the server environment to use the new engine which has more accurate memory management.

@arjunvenkatraman Server logs will aid in debugging.

<!-- gh-comment-id:4047909551 --> @rick-github commented on GitHub (Mar 12, 2026): @kaifhm ``` graph_reserve: failed to allocate compute buffers SIGSEGV: segmentation violation ``` By default the model is running on the llama.cpp engine which is sometimes inaccurate with memory estimation. Try setting `OLLAMA_NEW_ENGINE=1` in the server environment to use the new engine which has more accurate memory management. @arjunvenkatraman [Server logs](https://docs.ollama.com/troubleshooting) will aid in debugging.
Author
Owner

@kaifhm commented on GitHub (Mar 13, 2026):

@rick-github 1that worked. Thanks!

<!-- gh-comment-id:4052552747 --> @kaifhm commented on GitHub (Mar 13, 2026): @rick-github 1that worked. Thanks!
Author
Owner

@junkjunker commented on GitHub (Mar 19, 2026):

@kaifhm

By default the model is running on the llama.cpp engine which is sometimes inaccurate with memory estimation. Try setting OLLAMA_NEW_ENGINE=1 in the server environment to use the new engine which has more accurate memory management.

@rick-github Forgive my ignorance as I'm new to this. I've been getting this same error and am excited to find a possible fix. I'm running Windows 10, how/where do I set OLLAMA_NEW_ENGINE=1 in a server environment?

Thank you!

<!-- gh-comment-id:4092852923 --> @junkjunker commented on GitHub (Mar 19, 2026): > [@kaifhm](https://github.com/kaifhm) > > By default the model is running on the llama.cpp engine which is sometimes inaccurate with memory estimation. Try setting `OLLAMA_NEW_ENGINE=1` in the server environment to use the new engine which has more accurate memory management. > @rick-github Forgive my ignorance as I'm new to this. I've been getting this same error and am excited to find a possible fix. I'm running Windows 10, how/where do I set `OLLAMA_NEW_ENGINE=1` in a server environment? Thank you!
Author
Owner
<!-- gh-comment-id:4092867414 --> @rick-github commented on GitHub (Mar 19, 2026): https://github.com/ollama/ollama/blob/main/docs/faq.mdx#setting-environment-variables-on-windows
Author
Owner

@junkjunker commented on GitHub (Mar 19, 2026):

@rick-github Fantastic, thanks!

<!-- gh-comment-id:4092886171 --> @junkjunker commented on GitHub (Mar 19, 2026): @rick-github Fantastic, thanks!
Author
Owner

@junkjunker commented on GitHub (Mar 19, 2026):

Sadly, that didn't fix the "Error 500 Internal Server Error: llama runner process has terminated: exit status 2" for me.

<!-- gh-comment-id:4093164476 --> @junkjunker commented on GitHub (Mar 19, 2026): Sadly, that didn't fix the "Error 500 Internal Server Error: llama runner process has terminated: exit status 2" for me.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2026):

Open a new issue and post server logs.

<!-- gh-comment-id:4093171453 --> @rick-github commented on GitHub (Mar 19, 2026): Open a new issue and post [server logs](https://docs.ollama.com/troubleshooting).
Author
Owner

@junkjunker commented on GitHub (Mar 20, 2026):

Correction - it did work. Whew. Because of my aging Quadro M1000M on my Dell T5510 laptop, I also had to install a down rev version (12.2.2) of CUDA Toolkit (while also keeping the current video drivers). I then had to add several items to the environment path:
Win10 Settings>Advanced System Settings>Environment Variables>User Variables

  1. Edit Path and add: C:\Users\[Username]\AppData\Local\Programs\Ollama
  2. Add OLLAMA_MODELS, with value = [wherever-your-dir-is]\.ollama\models
  3. Add OLLAMA_NEW_ENGINE, with value = 1

Somewhere in there, Ollama decided that I did indeed have enough memory to run the llama models.

Thank you for your help!

<!-- gh-comment-id:4096000604 --> @junkjunker commented on GitHub (Mar 20, 2026): Correction - it did work. Whew. Because of my aging Quadro M1000M on my Dell T5510 laptop, I also had to install a down rev version (12.2.2) of CUDA Toolkit (while also keeping the current video drivers). I then had to add several items to the environment path: Win10 Settings>Advanced System Settings>Environment Variables>User Variables 1) Edit `Path` and add: `C:\Users\[Username]\AppData\Local\Programs\Ollama` 2) Add `OLLAMA_MODELS`, with value = `[wherever-your-dir-is]\.ollama\models` 3) Add `OLLAMA_NEW_ENGINE`, with value = `1` Somewhere in there, Ollama decided that I did indeed have enough memory to run the llama models. Thank you for your help!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71428