[GH-ISSUE #9061] Issue with CPU Runner Sapphire Rapids / Icelake crashing when using all CPU, no GPU #52410

Open
opened 2026-04-28 23:10:44 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @veratu on GitHub (Feb 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9061

What is the issue?

Have an interesting issue, here is the machine setup:

Intel Sapphire Rapids CPU 60 core W9-3595x
1 TB of RAM
5 NVidia GPUs
Ubuntu (tried 22.04, 24.04, and 24.10, all produce same result)

I compiled Ollama manually, using CUDA Toolkit 12.2 (Driver 535.x), also tried CUDA Toolkit 12.8 (latest, Driver 570.x), same result.

I'm using a docker open webui setup, model is at default settings (except I tell it no GPU), and just send the word "Test" to it so it responds.

Seems to be an issue where I can't use models purely on CPU with these CPU runners (I've tried Deepseek, Phi, Llama3.3) all have the same problem. Model loads fine into memory, it goes to start computing and crashes, however this only happens when using the Sapphire Rapids or Icelake runners.

If I use the standard Linux install which just uses cpu_avx or the cuda_avx runners, I have no problems (even with the models fully in system RAM using CPU only, they work fine, they're just slow and I want to leverage my CPUs features for more performance).

I did the manual compile because I'm trying to use the AVX512 and VNNI CPU features for more performance, but it seems the runners that support them crash but the normal CPU runners work fine.

Logs attached.

Relevant log output

root@ai:~/ollama# ./ollama serve
2025/02/13 02:29:30 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-02-13T02:29:30.509Z level=INFO source=images.go:432 msg="total blobs: 63"
time=2025-02-13T02:29:30.509Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-13T02:29:30.510Z level=INFO source=routes.go:1237 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2025-02-13T02:29:30.510Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f8a446db-ffc7-4e5b-93a2-13c48033be05 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB"
time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-b045c718-c6a4-9686-38c3-42791b4f2de1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-104be121-373b-ae2b-2e98-426a6f285d28 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-0a5760e1-89ad-208c-1363-db78838e753b library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c8c7a2b9-ee72-63b9-fc6d-959ce23cbc39 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
[GIN] 2025/02/13 - 02:29:40 | 200 |     168.051µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/02/13 - 02:35:37 | 200 |    3.790529ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/02/13 - 02:35:40 | 200 |      80.493µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/02/13 - 02:41:34 | 200 |    1.805961ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/02/13 - 02:41:34 | 200 |      134.61µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/02/13 - 02:44:23 | 200 |    3.618393ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/02/13 - 02:44:23 | 200 |      100.04µs |       127.0.0.1 | GET      "/api/version"
time=2025-02-13T02:47:04.296Z level=INFO source=server.go:100 msg="system memory" total="1002.9 GiB" free="996.0 GiB" free_swap="8.0 GiB"
time=2025-02-13T02:47:04.298Z level=INFO source=memory.go:356 msg="offload to cpu" layers.requested=0 layers.model=62 layers.offload=0 layers.split="" memory.available="[996.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="462.4 GiB" memory.required.partial="0 B" memory.required.kv="38.1 GiB" memory.required.allocations="[985.5 MiB]" memory.weights.total="413.6 GiB" memory.weights.repeating="412.9 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="2.2 GiB" memory.graph.partial="3.0 GiB"
time=2025-02-13T02:47:04.298Z level=INFO source=server.go:381 msg="starting llama server" cmd="/root/ollama/ollama runner --model /root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 8192 --batch-size 512 --n-gpu-layers 0 --threads 60 --no-mmap --parallel 4 --port 36687"
time=2025-02-13T02:47:04.299Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-13T02:47:04.299Z level=INFO source=server.go:558 msg="waiting for llama runner to start responding"
time=2025-02-13T02:47:04.300Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server error"
time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:936 msg="starting go runner"
time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=60
time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:36687"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
time=2025-02-13T02:47:04.552Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CUDA backend from /root/ollama/build/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /root/ollama/build/lib/ollama/libggml-cpu-sapphirerapids.so
llama_load_model_from_file: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_load_model_from_file: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23698 MiB free
llama_load_model_from_file: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 256x20B
llama_model_loader: - kv   3:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   4:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   5:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   6:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   7:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   8:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv   9:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  10: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  14:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  15:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  16:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  17:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  18:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  19:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  20:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  21:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  22:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  23:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  24:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  25:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  26:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  27: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  28: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<
                                                                                                                           llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  36:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  40:               general.quantization_version u32              = 2
llama_model_loader: - kv  41:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_K:  606 tensors
llama_model_loader: - type q6_K:   58 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 376.65 GiB (4.82 BPW) 
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
time=2025-02-13T02:47:24.364Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server not responding"
time=2025-02-13T02:48:29.758Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_host_malloc: failed to allocate 376141.90 MiB of pinned memory: invalid argument
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/62 layers to GPU
llm_load_tensors:          CPU model buffer size = 376141.90 MiB
llm_load_tensors:          AMX model buffer size =  9264.66 MiB
llm_load_tensors:          CPU model buffer size =   497.11 MiB
ggml_backend_amx_buffer_set_tensor: amx repack tensor output.weight of type q6_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_q_a.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_q_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_kv_a_mqa.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_kv_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_output.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_gate.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_down.weight of type q6_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_up.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_q_a.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_q_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_kv_a_mqa.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_kv_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_output.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_gate.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_down.weight of type q6_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_up.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_q_a.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_q_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_kv_a_mqa.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_kv_b.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_output.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_gate.weight of type q4_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_down.weight of type q6_K
ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_up.weight of type q4_K
< removed the other 58 cpus of this log above >
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 39040.00 MiB
llama_new_context_with_model: KV self size  = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.08 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  5039.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    96.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 1148 (with bs=512), 1 (with bs=1)
time=2025-02-13T02:52:07.997Z level=INFO source=server.go:597 msg="llama runner started in 303.70 seconds"
SIGSEGV: segmentation violation
PC=0x0 m=12 sigcode=1 addr=0x0
signal arrived during cgo execution

goroutine 76 gp=0xc000505880 m=12 mp=0xc00057ce08 [syscall]:
runtime.cgocall(0x5e2753f59d20, 0xc0001e5ba0)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/cgocall.go:167 +0x4b fp=0xc0001e5b78 sp=0xc0001e5b40 pc=0x5e27533b248b
github.com/ollama/ollama/llama._Cfunc_llama_decode(0x77e9c41f2870, {0x167, 0x77e9c41d0910, 0x0, 0x0, 0x77e9c4020120, 0x77e9c4022130, 0x77e9c48587c0, 0x77e9c41e8e40})
        _cgo_gotypes.go:553 +0x4f fp=0xc0001e5ba0 sp=0xc0001e5b78 pc=0x5e275376744f
github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5e275377644b?, 0x77e9c41f2870?)
        /root/ollama/llama/llama.go:163 +0xf5 fp=0xc0001e5c90 sp=0xc0001e5ba0 pc=0x5e275376a175
github.com/ollama/ollama/llama.(*Context).Decode(0xc0001e5d78?, 0x0?)
        /root/ollama/llama/llama.go:163 +0x13 fp=0xc0001e5cd8 sp=0xc0001e5c90 pc=0x5e2753769ff3
github.com/ollama/ollama/llama/runner.(*Server).processBatch(0xc000175560, 0xc000692480, 0xc0001e5f20)
        /root/ollama/llama/runner/runner.go:434 +0x23f fp=0xc0001e5ee0 sp=0xc0001e5cd8 pc=0x5e275377523f
github.com/ollama/ollama/llama/runner.(*Server).run(0xc000175560, {0x5e2754552100, 0xc00054d540})
        /root/ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc0001e5fb8 sp=0xc0001e5ee0 pc=0x5e2753774c75
github.com/ollama/ollama/llama/runner.Execute.gowrap2()
        /root/ollama/llama/runner/runner.go:975 +0x28 fp=0xc0001e5fe0 sp=0xc0001e5fb8 pc=0x5e2753779b68
runtime.goexit({})
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0001e5fe8 sp=0xc0001e5fe0 pc=0x5e27533c0f61
created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
        /root/ollama/llama/runner/runner.go:975 +0xde5

goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0006f35e8 sp=0xc0006f35c8 pc=0x5e27533b8b8e
runtime.netpollblock(0xc0006f3638?, 0x5334f9a6?, 0x27?)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/netpoll.go:575 +0xf7 fp=0xc0006f3620 sp=0xc0006f35e8 pc=0x5e275337c7f7
internal/poll.runtime_pollWait(0x77eb35ec6680, 0x72)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/netpoll.go:351 +0x85 fp=0xc0006f3640 sp=0xc0006f3620 pc=0x5e27533b7e85
internal/poll.(*pollDesc).wait(0xc000747e80?, 0x900000036?, 0x0)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0006f3668 sp=0xc0006f3640 pc=0x5e275343f647
internal/poll.(*pollDesc).waitRead(...)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000747e80)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_unix.go:620 +0x295 fp=0xc0006f3710 sp=0xc0006f3668 pc=0x5e2753444a15
net.(*netFD).accept(0xc000747e80)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/fd_unix.go:172 +0x29 fp=0xc0006f37c8 sp=0xc0006f3710 pc=0x5e27534adb09
net.(*TCPListener).accept(0xc0005498c0)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/tcpsock_posix.go:159 +0x1e fp=0xc0006f3818 sp=0xc0006f37c8 pc=0x5e27534c377e
net.(*TCPListener).Accept(0xc0005498c0)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/tcpsock.go:372 +0x30 fp=0xc0006f3848 sp=0xc0006f3818 pc=0x5e27534c2630
net/http.(*onceCloseListener).Accept(0xc0001c8990?)
        <autogenerated>:1 +0x24 fp=0xc0006f3860 sp=0xc0006f3848 pc=0x5e275370c8a4
net/http.(*Server).Serve(0xc0004b9680, {0x5e275454fdb0, 0xc0005498c0})
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/http/server.go:3330 +0x30c fp=0xc0006f3990 sp=0xc0006f3860 pc=0x5e27536e482c
github.com/ollama/ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?})
        /root/ollama/llama/runner/runner.go:996 +0x11a9 fp=0xc0006f3d30 sp=0xc0006f3990 pc=0x5e2753779849
github.com/ollama/ollama/cmd.NewCLI.func2(0xc000241400?, {0x5e275412215a?, 0x4?, 0x5e275412215e?})
        /root/ollama/cmd/cmd.go:1277 +0x45 fp=0xc0006f3d58 sp=0xc0006f3d30 pc=0x5e2753f59185
github.com/spf13/cobra.(*Command).execute(0xc0004f1b08, {0xc0004b8ff0, 0xf, 0xf})
        /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x862 fp=0xc0006f3e78 sp=0xc0006f3d58 pc=0x5e2753526842
github.com/spf13/cobra.(*Command).ExecuteC(0xc0004bb508)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0006f3f30 sp=0xc0006f3e78 pc=0x5e2753527085
github.com/spf13/cobra.(*Command).Execute(...)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
        /root/ollama/main.go:12 +0x4d fp=0xc0006f3f50 sp=0xc0006f3f30 pc=0x5e2753f5950d
runtime.main()
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:272 +0x29d fp=0xc0006f3fe0 sp=0xc0006f3f50 pc=0x5e2753383e9d
runtime.goexit({})
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0006f3fe8 sp=0xc0006f3fe0 pc=0x5e27533c0f61

goroutine 2 gp=0xc000006c40 m=nil [force gc (idle), 2 minutes]:
runtime.gopark(0x1984b111354?, 0x0?, 0x0?, 0x0?, 0x0?)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0000e4fa8 sp=0xc0000e4f88 pc=0x5e27533b8b8e
runtime.goparkunlock(...)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:430
runtime.forcegchelper()
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:337 +0xb8 fp=0xc0000e4fe0 sp=0xc0000e4fa8 pc=0x5e27533841d8
runtime.goexit({})
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000e4fe8 sp=0xc0000e4fe0 pc=0x5e27533c0f61
created by runtime.init.7 in goroutine 1
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:325 +0x1a

goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0000e5780 sp=0xc0000e5760 pc=0x5e27533b8b8e
runtime.goparkunlock(...)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:430
runtime.bgsweep(0xc000112000)
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgcsweep.go:317 +0xdf fp=0xc0000e57c8 sp=0xc0000e5780 pc=0x5e275336e87f
runtime.gcenable.gowrap1()
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgc.go:204 +0x25 fp=0xc0000e57e0 sp=0xc0000e57c8 pc=0x5e2753362ec5
runtime.goexit({})
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000e57e8 sp=0xc0000e57e0 pc=0x5e27533c0f61
created by runtime.gcenable in goroutine 1
        /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgc.go:204 +0x66

rax    0x0
rbx    0x77e9c6427b90
rcx    0x0
rdx    0x77ea41000010
rdi    0x77e9c64b9fe0
rsi    0x77e9c6427b90
rbp    0x77e9c8a00310
rsp    0x77ea56bff998
r8     0x5e8000
r9     0x0
r10    0x22
r11    0x246
r12    0x77ea41000010
r13    0x5e8000
r14    0x77e9c4353dc8
r15    0x77e9c64b9ee0
rip    0x0
rflags 0x10202
cs     0x33
fs     0x0
gs     0x0

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @veratu on GitHub (Feb 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9061 ### What is the issue? Have an interesting issue, here is the machine setup: Intel Sapphire Rapids CPU 60 core W9-3595x 1 TB of RAM 5 NVidia GPUs Ubuntu (tried 22.04, 24.04, and 24.10, all produce same result) I compiled Ollama manually, using CUDA Toolkit 12.2 (Driver 535.x), also tried CUDA Toolkit 12.8 (latest, Driver 570.x), same result. I'm using a docker open webui setup, model is at default settings (except I tell it no GPU), and just send the word "Test" to it so it responds. Seems to be an issue where I can't use models purely on CPU with these CPU runners (I've tried Deepseek, Phi, Llama3.3) all have the same problem. Model loads fine into memory, it goes to start computing and crashes, however this only happens when using the Sapphire Rapids or Icelake runners. If I use the standard Linux install which just uses cpu_avx or the cuda_avx runners, I have no problems (even with the models fully in system RAM using CPU only, they work fine, they're just slow and I want to leverage my CPUs features for more performance). I did the manual compile because I'm trying to use the AVX512 and VNNI CPU features for more performance, but it seems the runners that support them crash but the normal CPU runners work fine. Logs attached. ### Relevant log output ```shell root@ai:~/ollama# ./ollama serve 2025/02/13 02:29:30 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-02-13T02:29:30.509Z level=INFO source=images.go:432 msg="total blobs: 63" time=2025-02-13T02:29:30.509Z level=INFO source=images.go:439 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2025-02-13T02:29:30.510Z level=INFO source=routes.go:1237 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2025-02-13T02:29:30.510Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f8a446db-ffc7-4e5b-93a2-13c48033be05 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4090" total="23.5 GiB" available="23.1 GiB" time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-b045c718-c6a4-9686-38c3-42791b4f2de1 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-104be121-373b-ae2b-2e98-426a6f285d28 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-0a5760e1-89ad-208c-1363-db78838e753b library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" time=2025-02-13T02:29:31.803Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c8c7a2b9-ee72-63b9-fc6d-959ce23cbc39 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" [GIN] 2025/02/13 - 02:29:40 | 200 | 168.051µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/02/13 - 02:35:37 | 200 | 3.790529ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/13 - 02:35:40 | 200 | 80.493µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/02/13 - 02:41:34 | 200 | 1.805961ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/13 - 02:41:34 | 200 | 134.61µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/02/13 - 02:44:23 | 200 | 3.618393ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/13 - 02:44:23 | 200 | 100.04µs | 127.0.0.1 | GET "/api/version" time=2025-02-13T02:47:04.296Z level=INFO source=server.go:100 msg="system memory" total="1002.9 GiB" free="996.0 GiB" free_swap="8.0 GiB" time=2025-02-13T02:47:04.298Z level=INFO source=memory.go:356 msg="offload to cpu" layers.requested=0 layers.model=62 layers.offload=0 layers.split="" memory.available="[996.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="462.4 GiB" memory.required.partial="0 B" memory.required.kv="38.1 GiB" memory.required.allocations="[985.5 MiB]" memory.weights.total="413.6 GiB" memory.weights.repeating="412.9 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="2.2 GiB" memory.graph.partial="3.0 GiB" time=2025-02-13T02:47:04.298Z level=INFO source=server.go:381 msg="starting llama server" cmd="/root/ollama/ollama runner --model /root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 8192 --batch-size 512 --n-gpu-layers 0 --threads 60 --no-mmap --parallel 4 --port 36687" time=2025-02-13T02:47:04.299Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-13T02:47:04.299Z level=INFO source=server.go:558 msg="waiting for llama runner to start responding" time=2025-02-13T02:47:04.300Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server error" time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:936 msg="starting go runner" time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:937 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=60 time=2025-02-13T02:47:04.324Z level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:36687" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 5 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes time=2025-02-13T02:47:04.552Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CUDA backend from /root/ollama/build/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /root/ollama/build/lib/ollama/libggml-cpu-sapphirerapids.so llama_load_model_from_file: using device CUDA3 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_load_model_from_file: using device CUDA2 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23698 MiB free llama_load_model_from_file: using device CUDA4 (NVIDIA GeForce RTX 3090) - 23871 MiB free llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.size_label str = 256x20B llama_model_loader: - kv 3: deepseek2.block_count u32 = 61 llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3 llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "< llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 38: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 39: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 40: general.quantization_version u32 = 2 llama_model_loader: - kv 41: general.file_type u32 = 15 llama_model_loader: - type f32: 361 tensors llama_model_loader: - type q4_K: 606 tensors llama_model_loader: - type q6_K: 58 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 818 llm_load_vocab: token to piece cache size = 0.8223 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 129280 llm_load_print_meta: n_merges = 127741 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 7168 llm_load_print_meta: n_layer = 61 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 18432 llm_load_print_meta: n_expert = 256 llm_load_print_meta: n_expert_used = 8 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 671B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 671.03 B llm_load_print_meta: model size = 376.65 GiB (4.82 BPW) llm_load_print_meta: general.name = n/a llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 131 'Ä' llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 3 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 2048 llm_load_print_meta: n_expert_shared = 1 llm_load_print_meta: expert_weights_scale = 2.5 llm_load_print_meta: expert_weights_norm = 1 llm_load_print_meta: expert_gating_func = sigmoid llm_load_print_meta: rope_yarn_log_mul = 0.1000 time=2025-02-13T02:47:24.364Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server not responding" time=2025-02-13T02:48:29.758Z level=INFO source=server.go:592 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_host_malloc: failed to allocate 376141.90 MiB of pinned memory: invalid argument llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/62 layers to GPU llm_load_tensors: CPU model buffer size = 376141.90 MiB llm_load_tensors: AMX model buffer size = 9264.66 MiB llm_load_tensors: CPU model buffer size = 497.11 MiB ggml_backend_amx_buffer_set_tensor: amx repack tensor output.weight of type q6_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_q_a.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_q_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_kv_a_mqa.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_kv_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.attn_output.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_gate.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_down.weight of type q6_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.0.ffn_up.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_q_a.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_q_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_kv_a_mqa.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_kv_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.attn_output.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_gate.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_down.weight of type q6_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.1.ffn_up.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_q_a.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_q_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_kv_a_mqa.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_kv_b.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.attn_output.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_gate.weight of type q4_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_down.weight of type q6_K ggml_backend_amx_buffer_set_tensor: amx repack tensor blk.2.ffn_up.weight of type q4_K < removed the other 58 cpus of this log above > llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 llama_kv_cache_init: CPU KV buffer size = 39040.00 MiB llama_new_context_with_model: KV self size = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB llama_new_context_with_model: CPU output buffer size = 2.08 MiB llama_new_context_with_model: CUDA3 compute buffer size = 5039.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 96.01 MiB llama_new_context_with_model: graph nodes = 5025 llama_new_context_with_model: graph splits = 1148 (with bs=512), 1 (with bs=1) time=2025-02-13T02:52:07.997Z level=INFO source=server.go:597 msg="llama runner started in 303.70 seconds" SIGSEGV: segmentation violation PC=0x0 m=12 sigcode=1 addr=0x0 signal arrived during cgo execution goroutine 76 gp=0xc000505880 m=12 mp=0xc00057ce08 [syscall]: runtime.cgocall(0x5e2753f59d20, 0xc0001e5ba0) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/cgocall.go:167 +0x4b fp=0xc0001e5b78 sp=0xc0001e5b40 pc=0x5e27533b248b github.com/ollama/ollama/llama._Cfunc_llama_decode(0x77e9c41f2870, {0x167, 0x77e9c41d0910, 0x0, 0x0, 0x77e9c4020120, 0x77e9c4022130, 0x77e9c48587c0, 0x77e9c41e8e40}) _cgo_gotypes.go:553 +0x4f fp=0xc0001e5ba0 sp=0xc0001e5b78 pc=0x5e275376744f github.com/ollama/ollama/llama.(*Context).Decode.func1(0x5e275377644b?, 0x77e9c41f2870?) /root/ollama/llama/llama.go:163 +0xf5 fp=0xc0001e5c90 sp=0xc0001e5ba0 pc=0x5e275376a175 github.com/ollama/ollama/llama.(*Context).Decode(0xc0001e5d78?, 0x0?) /root/ollama/llama/llama.go:163 +0x13 fp=0xc0001e5cd8 sp=0xc0001e5c90 pc=0x5e2753769ff3 github.com/ollama/ollama/llama/runner.(*Server).processBatch(0xc000175560, 0xc000692480, 0xc0001e5f20) /root/ollama/llama/runner/runner.go:434 +0x23f fp=0xc0001e5ee0 sp=0xc0001e5cd8 pc=0x5e275377523f github.com/ollama/ollama/llama/runner.(*Server).run(0xc000175560, {0x5e2754552100, 0xc00054d540}) /root/ollama/llama/runner/runner.go:342 +0x1d5 fp=0xc0001e5fb8 sp=0xc0001e5ee0 pc=0x5e2753774c75 github.com/ollama/ollama/llama/runner.Execute.gowrap2() /root/ollama/llama/runner/runner.go:975 +0x28 fp=0xc0001e5fe0 sp=0xc0001e5fb8 pc=0x5e2753779b68 runtime.goexit({}) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0001e5fe8 sp=0xc0001e5fe0 pc=0x5e27533c0f61 created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 /root/ollama/llama/runner/runner.go:975 +0xde5 goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0006f35e8 sp=0xc0006f35c8 pc=0x5e27533b8b8e runtime.netpollblock(0xc0006f3638?, 0x5334f9a6?, 0x27?) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/netpoll.go:575 +0xf7 fp=0xc0006f3620 sp=0xc0006f35e8 pc=0x5e275337c7f7 internal/poll.runtime_pollWait(0x77eb35ec6680, 0x72) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/netpoll.go:351 +0x85 fp=0xc0006f3640 sp=0xc0006f3620 pc=0x5e27533b7e85 internal/poll.(*pollDesc).wait(0xc000747e80?, 0x900000036?, 0x0) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0006f3668 sp=0xc0006f3640 pc=0x5e275343f647 internal/poll.(*pollDesc).waitRead(...) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc000747e80) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/internal/poll/fd_unix.go:620 +0x295 fp=0xc0006f3710 sp=0xc0006f3668 pc=0x5e2753444a15 net.(*netFD).accept(0xc000747e80) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/fd_unix.go:172 +0x29 fp=0xc0006f37c8 sp=0xc0006f3710 pc=0x5e27534adb09 net.(*TCPListener).accept(0xc0005498c0) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/tcpsock_posix.go:159 +0x1e fp=0xc0006f3818 sp=0xc0006f37c8 pc=0x5e27534c377e net.(*TCPListener).Accept(0xc0005498c0) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/tcpsock.go:372 +0x30 fp=0xc0006f3848 sp=0xc0006f3818 pc=0x5e27534c2630 net/http.(*onceCloseListener).Accept(0xc0001c8990?) <autogenerated>:1 +0x24 fp=0xc0006f3860 sp=0xc0006f3848 pc=0x5e275370c8a4 net/http.(*Server).Serve(0xc0004b9680, {0x5e275454fdb0, 0xc0005498c0}) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/net/http/server.go:3330 +0x30c fp=0xc0006f3990 sp=0xc0006f3860 pc=0x5e27536e482c github.com/ollama/ollama/llama/runner.Execute({0xc000036130?, 0x0?, 0x0?}) /root/ollama/llama/runner/runner.go:996 +0x11a9 fp=0xc0006f3d30 sp=0xc0006f3990 pc=0x5e2753779849 github.com/ollama/ollama/cmd.NewCLI.func2(0xc000241400?, {0x5e275412215a?, 0x4?, 0x5e275412215e?}) /root/ollama/cmd/cmd.go:1277 +0x45 fp=0xc0006f3d58 sp=0xc0006f3d30 pc=0x5e2753f59185 github.com/spf13/cobra.(*Command).execute(0xc0004f1b08, {0xc0004b8ff0, 0xf, 0xf}) /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:940 +0x862 fp=0xc0006f3e78 sp=0xc0006f3d58 pc=0x5e2753526842 github.com/spf13/cobra.(*Command).ExecuteC(0xc0004bb508) /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc0006f3f30 sp=0xc0006f3e78 pc=0x5e2753527085 github.com/spf13/cobra.(*Command).Execute(...) /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) /root/go/pkg/mod/github.com/spf13/cobra@v1.7.0/command.go:985 main.main() /root/ollama/main.go:12 +0x4d fp=0xc0006f3f50 sp=0xc0006f3f30 pc=0x5e2753f5950d runtime.main() /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:272 +0x29d fp=0xc0006f3fe0 sp=0xc0006f3f50 pc=0x5e2753383e9d runtime.goexit({}) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0006f3fe8 sp=0xc0006f3fe0 pc=0x5e27533c0f61 goroutine 2 gp=0xc000006c40 m=nil [force gc (idle), 2 minutes]: runtime.gopark(0x1984b111354?, 0x0?, 0x0?, 0x0?, 0x0?) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0000e4fa8 sp=0xc0000e4f88 pc=0x5e27533b8b8e runtime.goparkunlock(...) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:430 runtime.forcegchelper() /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:337 +0xb8 fp=0xc0000e4fe0 sp=0xc0000e4fa8 pc=0x5e27533841d8 runtime.goexit({}) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000e4fe8 sp=0xc0000e4fe0 pc=0x5e27533c0f61 created by runtime.init.7 in goroutine 1 /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:325 +0x1a goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:424 +0xce fp=0xc0000e5780 sp=0xc0000e5760 pc=0x5e27533b8b8e runtime.goparkunlock(...) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/proc.go:430 runtime.bgsweep(0xc000112000) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgcsweep.go:317 +0xdf fp=0xc0000e57c8 sp=0xc0000e5780 pc=0x5e275336e87f runtime.gcenable.gowrap1() /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgc.go:204 +0x25 fp=0xc0000e57e0 sp=0xc0000e57c8 pc=0x5e2753362ec5 runtime.goexit({}) /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000e57e8 sp=0xc0000e57e0 pc=0x5e27533c0f61 created by runtime.gcenable in goroutine 1 /root/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.4.linux-amd64/src/runtime/mgc.go:204 +0x66 rax 0x0 rbx 0x77e9c6427b90 rcx 0x0 rdx 0x77ea41000010 rdi 0x77e9c64b9fe0 rsi 0x77e9c6427b90 rbp 0x77e9c8a00310 rsp 0x77ea56bff998 r8 0x5e8000 r9 0x0 r10 0x22 r11 0x246 r12 0x77ea41000010 r13 0x5e8000 r14 0x77e9c4353dc8 r15 0x77e9c64b9ee0 rip 0x0 rflags 0x10202 cs 0x33 fs 0x0 gs 0x0 ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-28 23:10:44 -05:00
Author
Owner

@veratu commented on GitHub (Feb 14, 2025):

This doesn't appear to happen in the pre-compiled 0.5.9 release that came out today. I see in the logs it's loading sapphire rapids, and it's not crashing. So I will use that release for now.

<!-- gh-comment-id:2657981166 --> @veratu commented on GitHub (Feb 14, 2025): This doesn't appear to happen in the pre-compiled 0.5.9 release that came out today. I see in the logs it's loading sapphire rapids, and it's not crashing. So I will use that release for now.
Author
Owner

@veratu commented on GitHub (Feb 14, 2025):

The crashing has definitely gone away however CPU performance has gone down instead of up. Also, large models like Deepseek 671 on CPU only often get "stuck", and won't purge out of memory after 5m and stay at max CPU, like they're trying to compute but there is not prompt sent to it, even issuing an ollama stop doesn't get it to shut down. I had to stop ollama entirely to get it to stop. The timer never goes down either, it just stays at the same time, like it's auto-refreshing itself and computing, on thin air.

<!-- gh-comment-id:2658384591 --> @veratu commented on GitHub (Feb 14, 2025): The crashing has definitely gone away however CPU performance has gone down instead of up. Also, large models like Deepseek 671 on CPU only often get "stuck", and won't purge out of memory after 5m and stay at max CPU, like they're trying to compute but there is not prompt sent to it, even issuing an ollama stop <model> doesn't get it to shut down. I had to stop ollama entirely to get it to stop. The timer never goes down either, it just stays at the same time, like it's auto-refreshing itself and computing, on thin air.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52410