[GH-ISSUE #6595] 4 AMD GPUs with mixed VRAM sizes: layer predictions incorrect leads to runner crash #66190

New Issue

GiteaMirror · 2026-05-04T00:31:46-05:00

GiteaMirror commented

2026-05-04 00:31:46 -05:00

Originally created by @MikeLP on GitHub (Sep 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6595

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

When I load a large model that doesn't fit in VRAM, Ollama crashes:

➜ ~ ollama run dbrx:132b-instruct-q8_0
Error: llama runner process has terminated: signal: segmentation fault (core dumped)

This issue does not occur with Ollama 0.3.6.

My hardware:
CPU: AMD Ryzen Threadripper PRO 7965WX 24-Cores
GPU 1: AMD Instinct MI100 [Discrete]
GPU 2 AMD Instinct MI100 [Discrete]
GPU 3: AMD Radeon RX 6900 XT [Discrete]
GPU 4: AMD Radeon VII [Discrete]
VRAM: 96GiB
RAM: 128 GiB

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.7 - 0.3.9

Originally created by @MikeLP on GitHub (Sep 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6595 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? When I load a large model that doesn't fit in VRAM, Ollama crashes: ➜ ~ ollama run dbrx:132b-instruct-q8_0 Error: llama runner process has terminated: signal: segmentation fault (core dumped) This issue does not occur with Ollama 0.3.6. My hardware: CPU: AMD Ryzen Threadripper PRO 7965WX 24-Cores GPU 1: AMD Instinct MI100 [Discrete] GPU 2 AMD Instinct MI100 [Discrete] GPU 3: AMD Radeon RX 6900 XT [Discrete] GPU 4: AMD Radeon VII [Discrete] VRAM: 96GiB RAM: 128 GiB ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.7 - 0.3.9

GiteaMirror added the amd memory bug labels 2026-05-04 00:31:48 -05:00

GiteaMirror closed this issue

2026-05-04 00:31:54 -05:00

GiteaMirror commented

2026-05-04 00:31:58 -05:00

@jmorganca commented on GitHub (Sep 2, 2024):

Thanks for the issue!

@jmorganca commented on GitHub (Sep 2, 2024): Thanks for the issue!

GiteaMirror commented

2026-05-04 00:32:00 -05:00

@dhiltgen commented on GitHub (Sep 3, 2024):

Can you share your server log?

@dhiltgen commented on GitHub (Sep 3, 2024): Can you share your [server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md)?

GiteaMirror commented

2026-05-04 00:32:02 -05:00

@MikeLP commented on GitHub (Sep 3, 2024):

@dhiltgen

Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service.
Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.539-07:00 level=INFO source=images.go:753 msg="total blobs: 155"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.541-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.9)"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama457793794/runners
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx908
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=1 gpu_type=gfx908
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=2 gpu_type=gfx1030
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=3 gpu_type=gfx906
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=1 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=2 library=rocm variant="" compute=gfx1030 driver=6.8 name=1002:73bf total="16.0 GiB" available="12.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=3 library=rocm variant="" compute=gfx906 driver=6.8 name=1002:66af total="16.0 GiB" available="16.0 GiB"
Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 |       33.22µs |       127.0.0.1 | HEAD     "/"
Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 |   23.079446ms |       127.0.0.1 | POST     "/api/show"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.285-07:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=45 layers.split=17,17,4,7 memory.available="[32.0 GiB 32.0 GiB 12.0 GiB 16.0 GiB]" memory.required.full="122.9 GiB" memory.required.partial="90.1 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[31.5 GiB 31.5 GiB 11.2 GiB 15.9 GiB]" memory.weights.total="100.1 GiB" memory.weights.repeating="97.0 GiB" memory.weights.nonrepeating="3.1 GiB" memory.graph.full="2.9 GiB" memory.graph.partial="2.9 GiB"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama457793794/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 45 --no-mmap --parallel 1 --tensor-split 17,17,4,7 --port 34797"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] build info | build=1 commit="1e6f655" tid="132514998158144" timestamp=1725393646
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] system info | n_threads=24 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132514998158144" timestamp=1725393646 total_threads=48
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="34797" tid="132514998158144" timestamp=1725393646
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: loaded meta data with 34 key-value pairs and 642 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 (version GGUF V3 (latest))
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   0:                       general.architecture str              = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   2:                               general.name str              = C4Ai Command R Plus 08 2024
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   3:                            general.version str              = 08-2024
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   4:                           general.basename str              = c4ai-command-r-plus
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   5:                         general.size_label str              = 104B
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   6:                            general.license str              = cc-by-nc-4.0
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   7:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   8:                      command-r.block_count u32              = 64
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   9:                   command-r.context_length u32              = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  10:                 command-r.embedding_length u32              = 12288
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  11:              command-r.feed_forward_length u32              = 33792
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  12:             command-r.attention.head_count u32              = 96
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  13:          command-r.attention.head_count_kv u32              = 8
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  14:                   command-r.rope.freq_base f32              = 8000000.000000
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  15:     command-r.attention.layer_norm_epsilon f32              = 0.000010
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  16:                          general.file_type u32              = 7
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  17:                      command-r.logit_scale f32              = 0.833333
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  18:                command-r.rope.scaling.type str              = none
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 5
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 255001
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  29:           tokenizer.chat_template.tool_use str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  30:                tokenizer.chat_template.rag str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  31:                   tokenizer.chat_templates arr[str,2]       = ["rag", "tool_use"]
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  33:               general.quantization_version u32              = 2
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type  f32:  193 tensors
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type q8_0:  449 tensors
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.539-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: special tokens cache size = 37
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: token to piece cache size = 1.8426 MB
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: arch             = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab type       = BPE
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_vocab          = 256000
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_merges         = 253333
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab_only       = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_train      = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd           = 12288
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_layer          = 64
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head           = 96
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head_kv        = 8
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_rot            = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_swa            = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_k    = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_v    = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_gqa            = 12
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_logit_scale    = 8.3e-01
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ff             = 33792
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert         = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert_used    = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: causal attn      = 1
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: pooling type     = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope type        = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope scaling     = none
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_base_train  = 8000000.0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_scale_train = 1
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope_finetuned   = unknown
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_conv       = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_inner      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_state      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model type       = ?B
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model ftype      = Q8_0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model params     = 103.81 B
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model size       = 102.73 GiB (8.50 BPW)
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: general.name     = C4Ai Command R Plus 08 2024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: PAD token        = 0 '<PAD>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: LF token         = 136 'Ä'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: max token length = 1024
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: found 4 ROCm devices:
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 2: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 3: AMD Radeon VII, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]: llm_load_tensors: ggml ctx size =    1.47 MiB
Sep 03 13:00:49 iLinux ollama[1114471]: time=2024-09-03T13:00:49.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:00:53 iLinux ollama[1114471]: time=2024-09-03T13:00:53.569-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:55 iLinux ollama[1114471]: time=2024-09-03T13:00:55.022-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:00:58 iLinux ollama[1114471]: time=2024-09-03T13:00:58.957-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloading 45 repeating layers to GPU
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloaded 45/65 layers to GPU
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm0 buffer size = 27095.41 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm1 buffer size = 27095.41 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm2 buffer size =  6375.39 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm3 buffer size = 11156.93 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:  ROCm_Host buffer size = 36658.15 MiB
Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.160-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s |       127.0.0.1 | POST     "/api/chat"
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8

@MikeLP commented on GitHub (Sep 3, 2024): @dhiltgen ```log Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service. Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.539-07:00 level=INFO source=images.go:753 msg="total blobs: 155" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.541-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.9)" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama457793794/runners Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx908 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=1 gpu_type=gfx908 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=2 gpu_type=gfx1030 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=3 gpu_type=gfx906 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=1 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=2 library=rocm variant="" compute=gfx1030 driver=6.8 name=1002:73bf total="16.0 GiB" available="12.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=3 library=rocm variant="" compute=gfx906 driver=6.8 name=1002:66af total="16.0 GiB" available="16.0 GiB" Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 | 33.22µs | 127.0.0.1 | HEAD "/" Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 | 23.079446ms | 127.0.0.1 | POST "/api/show" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.285-07:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=45 layers.split=17,17,4,7 memory.available="[32.0 GiB 32.0 GiB 12.0 GiB 16.0 GiB]" memory.required.full="122.9 GiB" memory.required.partial="90.1 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[31.5 GiB 31.5 GiB 11.2 GiB 15.9 GiB]" memory.weights.total="100.1 GiB" memory.weights.repeating="97.0 GiB" memory.weights.nonrepeating="3.1 GiB" memory.graph.full="2.9 GiB" memory.graph.partial="2.9 GiB" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama457793794/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 45 --no-mmap --parallel 1 --tensor-split 17,17,4,7 --port 34797" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] build info | build=1 commit="1e6f655" tid="132514998158144" timestamp=1725393646 Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] system info | n_threads=24 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132514998158144" timestamp=1725393646 total_threads=48 Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="34797" tid="132514998158144" timestamp=1725393646 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: loaded meta data with 34 key-value pairs and 642 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 (version GGUF V3 (latest)) Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 0: general.architecture str = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 1: general.type str = model Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 2: general.name str = C4Ai Command R Plus 08 2024 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 3: general.version str = 08-2024 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 4: general.basename str = c4ai-command-r-plus Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 5: general.size_label str = 104B Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 6: general.license str = cc-by-nc-4.0 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 7: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 8: command-r.block_count u32 = 64 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 9: command-r.context_length u32 = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 10: command-r.embedding_length u32 = 12288 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 11: command-r.feed_forward_length u32 = 33792 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 12: command-r.attention.head_count u32 = 96 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 13: command-r.attention.head_count_kv u32 = 8 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 14: command-r.rope.freq_base f32 = 8000000.000000 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 15: command-r.attention.layer_norm_epsilon f32 = 0.000010 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 16: general.file_type u32 = 7 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 17: command-r.logit_scale f32 = 0.833333 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 18: command-r.rope.scaling.type str = none Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 5 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 255001 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = true Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 29: tokenizer.chat_template.tool_use str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 30: tokenizer.chat_template.rag str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 31: tokenizer.chat_templates arr[str,2] = ["rag", "tool_use"] Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 33: general.quantization_version u32 = 2 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type f32: 193 tensors Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type q8_0: 449 tensors Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.539-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: special tokens cache size = 37 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: token to piece cache size = 1.8426 MB Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: format = GGUF V3 (latest) Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: arch = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab type = BPE Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_vocab = 256000 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_merges = 253333 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab_only = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_train = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd = 12288 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_layer = 64 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head = 96 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head_kv = 8 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_rot = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_swa = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_k = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_v = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_gqa = 12 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_eps = 1.0e-05 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_logit_scale = 8.3e-01 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ff = 33792 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert_used = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: causal attn = 1 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: pooling type = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope type = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope scaling = none Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_base_train = 8000000.0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_scale_train = 1 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope_finetuned = unknown Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_conv = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_inner = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_state = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_dt_rank = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model type = ?B Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model ftype = Q8_0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model params = 103.81 B Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model size = 102.73 GiB (8.50 BPW) Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: general.name = C4Ai Command R Plus 08 2024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: PAD token = 0 '<PAD>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: LF token = 136 'Ä' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: max token length = 1024 Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: found 4 ROCm devices: Sep 03 13:00:47 iLinux ollama[1114471]: Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 2: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 3: AMD Radeon VII, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: llm_load_tensors: ggml ctx size = 1.47 MiB Sep 03 13:00:49 iLinux ollama[1114471]: time=2024-09-03T13:00:49.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:00:53 iLinux ollama[1114471]: time=2024-09-03T13:00:53.569-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:55 iLinux ollama[1114471]: time=2024-09-03T13:00:55.022-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:00:58 iLinux ollama[1114471]: time=2024-09-03T13:00:58.957-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloading 45 repeating layers to GPU Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloaded 45/65 layers to GPU Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm0 buffer size = 27095.41 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm1 buffer size = 27095.41 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm2 buffer size = 6375.39 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm3 buffer size = 11156.93 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm_Host buffer size = 36658.15 MiB Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.160-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat" Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 ```

GiteaMirror commented

2026-05-04 00:32:04 -05:00

@vanife commented on GitHub (Sep 4, 2024):

@dhiltgen

Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service.
Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config
....
....
....

msg="waiting for server to become available" status="llm server not responding"

Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat"
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8

I have similar problem with 6 AMD Radeon Pro VII for larger models.
I am wondering if this is because my RAM (not VRAM) is only 64GB. Does it need to be larger that the VRAM requirement of the model?

@vanife commented on GitHub (Sep 4, 2024): > @dhiltgen > > ``` > Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service. > Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config > .... > .... > .... msg="waiting for server to become available" status="llm server not responding" > Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" > Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat" > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > ``` I have similar problem with 6 AMD Radeon Pro VII for larger models. I am wondering if this is because my RAM (not VRAM) is only 64GB. Does it need to be larger that the VRAM requirement of the model?

GiteaMirror commented

2026-05-04 00:32:06 -05:00

@dhiltgen commented on GitHub (Sep 4, 2024):

@MikeLP as a workaround, are you able to reduce the number of layers loaded via num_gpu to get it to load, and if so, how much did we over-shoot?

@vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.

@dhiltgen commented on GitHub (Sep 4, 2024): @MikeLP as a workaround, are you able to reduce the number of layers loaded via `num_gpu` to get it to load, and if so, how much did we over-shoot? @vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.

GiteaMirror commented

2026-05-04 00:32:08 -05:00

@vanife commented on GitHub (Sep 4, 2024):

@MikeLP as a workaround, are you able to reduce the number of layers loaded via num_gpu to get it to load, and if so, how much did we over-shoot?

@vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.

Thank you, @dhiltgen.
I have 96GB total VRAM (6x 16GB).

For me these do work: qwen2:72b (41GB), llama3.1:70b (39 GB, 58% of 6 GPUs). But llama3.1:70b-instruct-q5_K_M (49GB) does not load anymore, even though it clearly has sufficient VRAM to load the whole model (and RAM as well as 64GB should be sufficient, I think).

This is the result of me running the following command (on ubuntu 22.04): OLLAMA_DEBUG=1 ollama run llama3.1:70b-instruct-q5_K_M (which should require ~49GB VRAM out of my 96GB available):

Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 |      19.045µs |       127.0.0.1 | HEAD     "/"
Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 |   11.827211ms |       127.0.0.1 | POST     "/api/show"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd library=rocm parallel=4 required="61.4 GiB"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=14,14,14,13,13,13 memory.available="[15.0 GiB 15.0 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.required.full="61.4 GiB" memory.required.partial="61.4 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[10.7 GiB 10.4 GiB 10.6 GiB 10.1 GiB 9.8 GiB 9.8 GiB]" memory.weights.total="47.5 GiB" memory.weights.repeating="46.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama3065803334/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --tensor-split 14,14,14,13,13,13 --port 36599"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] build info | build=1 commit="1e6f655" tid="131619309900608" timestamp=1725468351
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131619309900608" timestamp=1725468351 total_threads=32
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36599" tid="131619309900608" timestamp=1725468351
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd (version GGUF V3 (latest))
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   5:                         general.size_label str              = 70B
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   9:                          llama.block_count u32              = 80
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  17:                          general.file_type u32              = 17
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type  f32:  162 tensors
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q5_K:  481 tensors
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q6_K:   81 tensors
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.959+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 18:45:51 mypc ollama[136853]: llm_load_vocab: special tokens cache size = 256
Sep 04 18:45:52 mypc ollama[136853]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: arch             = llama
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab type       = BPE
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_vocab          = 128256
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_merges         = 280147
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab_only       = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd           = 8192
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_layer          = 80
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head           = 64
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head_kv        = 8
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_rot            = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_swa            = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_gqa            = 8
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ff             = 28672
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert         = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert_used    = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: causal attn      = 1
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: pooling type     = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope type        = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope scaling     = linear
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_scale_train = 1
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model type       = 70B
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model ftype      = Q5_K - Medium
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model params     = 70.55 B
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: max token length = 256
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: found 6 ROCm devices:
Sep 04 18:45:52 mypc ollama[136853]:   Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 1: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 2: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 3: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 4: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 5: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  8193.82 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  8008.94 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  7978.13 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  7447.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  7417.07 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  7893.67 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:  ROCm_Host buffer size =   688.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 |  1.797167336s |       127.0.0.1 | POST     "/api/chat"
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd

@vanife commented on GitHub (Sep 4, 2024): > @MikeLP as a workaround, are you able to reduce the number of layers loaded via `num_gpu` to get it to load, and if so, how much did we over-shoot? > > @vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash. Thank you, @dhiltgen. I have 96GB total VRAM (6x 16GB). For me these do work: `qwen2:72b` (41GB), `llama3.1:70b` (39 GB, 58% of 6 GPUs). But `llama3.1:70b-instruct-q5_K_M` (49GB) does not load anymore, even though it clearly has sufficient VRAM to load the whole model (and RAM as well as 64GB should be sufficient, I think). This is the result of me running the following command (on ubuntu 22.04): ` OLLAMA_DEBUG=1 ollama run llama3.1:70b-instruct-q5_K_M` (which should require ~49GB VRAM out of my 96GB available): ``` Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 | 19.045µs | 127.0.0.1 | HEAD "/" Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 | 11.827211ms | 127.0.0.1 | POST "/api/show" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd library=rocm parallel=4 required="61.4 GiB" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=14,14,14,13,13,13 memory.available="[15.0 GiB 15.0 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.required.full="61.4 GiB" memory.required.partial="61.4 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[10.7 GiB 10.4 GiB 10.6 GiB 10.1 GiB 9.8 GiB 9.8 GiB]" memory.weights.total="47.5 GiB" memory.weights.repeating="46.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama3065803334/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --tensor-split 14,14,14,13,13,13 --port 36599" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] build info | build=1 commit="1e6f655" tid="131619309900608" timestamp=1725468351 Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131619309900608" timestamp=1725468351 total_threads=32 Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36599" tid="131619309900608" timestamp=1725468351 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd (version GGUF V3 (latest)) Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 1: general.type str = model Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 5: general.size_label str = 70B Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 9: llama.block_count u32 = 80 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 17: general.file_type u32 = 17 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type f32: 162 tensors Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q5_K: 481 tensors Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q6_K: 81 tensors Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.959+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 18:45:51 mypc ollama[136853]: llm_load_vocab: special tokens cache size = 256 Sep 04 18:45:52 mypc ollama[136853]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: arch = llama Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab type = BPE Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_vocab = 128256 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_merges = 280147 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab_only = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd = 8192 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_layer = 80 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head = 64 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head_kv = 8 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_rot = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_swa = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_gqa = 8 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ff = 28672 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert_used = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: causal attn = 1 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: pooling type = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope type = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope scaling = linear Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_base_train = 500000.0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_scale_train = 1 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope_finetuned = unknown Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_state = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model type = 70B Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model ftype = Q5_K - Medium Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model params = 70.55 B Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: LF token = 128 'Ä' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: max token length = 256 Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: found 6 ROCm devices: Sep 04 18:45:52 mypc ollama[136853]: Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 1: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 2: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 3: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 4: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 5: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 8193.82 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 8008.94 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 7978.13 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 7447.88 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 7417.07 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 7893.67 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm_Host buffer size = 688.88 MiB Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 | 1.797167336s | 127.0.0.1 | POST "/api/chat" Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd ```

GiteaMirror commented

2026-05-04 00:32:10 -05:00

@vanife commented on GitHub (Sep 4, 2024):

I also tried the "success" scenario with this result:
Command: OLLAMA_DEBUG=1 ollama run llama3.1:70b

rocm-smi result once the client started loading:

=============================================== ROCm System Management Interface ===============================================
========================================================= Concise Info =========================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK     Fan     Perf    PwrCap       VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
================================================================================================================================
0       1     0x66a1,   7068   50.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       55%    0%
1       2     0x66a1,   20169  45.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
2       3     0x66a1,   40303  41.0°C  22.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
3       4     0x66a1,   63425  44.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
4       5     0x66a1,   53400  39.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
5       6     0x66a1,   11634  41.0°C  19.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       55%    0%
================================================================================================================================
===================================================== End of ROCm SMI Log ======================================================

ollama ps:

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3.1:70b    d729c66f84de    55 GB   100% GPU        4 minutes from now

and the result of the journalctl --since "15 minutes ago" -u ollama --no-pager is the same as "failure" scenario until the llm_load_tensors block starts, so I will post only the rest of the result since then:

failure (same as previous post) llama3.1:70b-instruct-q5_K_M:

Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  8193.82 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  8008.94 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  7978.13 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  7447.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  7417.07 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  7893.67 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:  ROCm_Host buffer size =   688.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 |  1.797167336s |       127.0.0.1 | POST     "/api/chat"
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd

success (llama3.1:70b):

Sep 04 19:22:51 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.474+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.921+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  5967.81 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  5967.81 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  6330.73 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:        CPU buffer size =   563.62 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_batch    = 512
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: flash_attn = 0
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_base  = 500000.0
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_scale = 1
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm0 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm1 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm2 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm3 KV buffer size =   416.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm4 KV buffer size =   416.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm5 KV buffer size =   384.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:  ROCm_Host  output buffer size =     2.08 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm0 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm1 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm2 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm3 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm4 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm5 compute buffer size =  1216.02 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:  ROCm_Host compute buffer size =    80.02 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph nodes  = 2566
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph splits = 7
Sep 04 19:24:02 mypc ollama[2536770]: INFO [main] model loaded | tid="140583704699712" timestamp=1725470642
Sep 04 19:24:02 mypc ollama[136853]: time=2024-09-04T19:24:02.863+02:00 level=INFO source=server.go:630 msg="llama runner started in 72.35 seconds"

Any other ideas what I could try?

@vanife commented on GitHub (Sep 4, 2024): I also tried the "success" scenario with this result: Command: `OLLAMA_DEBUG=1 ollama run llama3.1:70b` `rocm-smi` result once the client started loading: ``` =============================================== ROCm System Management Interface =============================================== ========================================================= Concise Info ========================================================= Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ================================================================================================================================ 0 1 0x66a1, 7068 50.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 55% 0% 1 2 0x66a1, 20169 45.0°C 17.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 2 3 0x66a1, 40303 41.0°C 22.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 3 4 0x66a1, 63425 44.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 4 5 0x66a1, 53400 39.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 5 6 0x66a1, 11634 41.0°C 19.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 55% 0% ================================================================================================================================ ===================================================== End of ROCm SMI Log ====================================================== ``` `ollama ps`: ``` NAME ID SIZE PROCESSOR UNTIL llama3.1:70b d729c66f84de 55 GB 100% GPU 4 minutes from now ``` and the result of the `journalctl --since "15 minutes ago" -u ollama --no-pager` is the same as "failure" scenario until the `llm_load_tensors` block starts, so I will post only the rest of the result since then: **failure (same as previous post) `llama3.1:70b-instruct-q5_K_M`**: ``` Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 8193.82 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 8008.94 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 7978.13 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 7447.88 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 7417.07 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 7893.67 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm_Host buffer size = 688.88 MiB Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 | 1.797167336s | 127.0.0.1 | POST "/api/chat" Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd ``` **success (`llama3.1:70b`):** ``` Sep 04 19:22:51 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.474+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.921+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 5967.81 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 5967.81 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 6330.73 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: CPU buffer size = 563.62 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ctx = 8192 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_batch = 512 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ubatch = 512 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: flash_attn = 0 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_base = 500000.0 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_scale = 1 Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm0 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm1 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm2 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm3 KV buffer size = 416.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm4 KV buffer size = 416.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm5 KV buffer size = 384.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm_Host output buffer size = 2.08 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm0 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm1 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm2 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm3 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm4 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm5 compute buffer size = 1216.02 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm_Host compute buffer size = 80.02 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph nodes = 2566 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph splits = 7 Sep 04 19:24:02 mypc ollama[2536770]: INFO [main] model loaded | tid="140583704699712" timestamp=1725470642 Sep 04 19:24:02 mypc ollama[136853]: time=2024-09-04T19:24:02.863+02:00 level=INFO source=server.go:630 msg="llama runner started in 72.35 seconds" ``` Any other ideas what I could try?

GiteaMirror commented

2026-05-04 00:32:13 -05:00

@dhiltgen commented on GitHub (Sep 5, 2024):

@vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect)

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server

@dhiltgen commented on GitHub (Sep 5, 2024): @vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect) https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server

GiteaMirror commented

2026-05-04 00:32:16 -05:00

@MikeLP commented on GitHub (Sep 7, 2024):

@dhiltgen I was playing with a manual build of llama.cpp, and it's definitely a bug in llama.cpp.
Because before that, I didn't have any issues running large models. So it looks like starting from version 0.3.7, ollama uses the latest llama.cpp which has this bug.

Here is my build command

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 48

I tried offloading all layers and then just one layer, but it still crashes.

Output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
  Device 1: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no
  Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.68 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   518.88 MiB
llm_load_tensors:        CPU buffer size = 40543.11 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   512.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size = 40448.00 MiB
llama_new_context_with_model: KV self size  = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.98 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 18035509504
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf'
 ERR [              load_model] unable to load model | tid="125643742393280" timestamp=1725676299 model="../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf"
 ERR [                    main] exiting due to model loading error | tid="125643742393280" timestamp=1725676299
[1]    2017050 segmentation fault (core dumped)  build/bin/llama-server -m  -ngl 1

P.S.
It fails even without offloading layers to GPU (-ngl 0)

@MikeLP commented on GitHub (Sep 7, 2024): @dhiltgen I was playing with a manual build of llama.cpp, and it's definitely a bug in llama.cpp. Because before that, I didn't have any issues running large models. So it looks like starting from version 0.3.7, ollama uses the latest llama.cpp which has this bug. Here is my build command ```sh HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --config Release -- -j 48 ``` I tried offloading all layers and then just one layer, but it still crashes. Output: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 ROCm devices: Device 0: AMD Radeon VII, compute capability 9.0, VMM: no Device 1: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no llm_load_tensors: ggml ctx size = 0.68 MiB llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/81 layers to GPU llm_load_tensors: ROCm0 buffer size = 518.88 MiB llm_load_tensors: CPU buffer size = 40543.11 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 40448.00 MiB llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 18035509504 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf' ERR [ load_model] unable to load model | tid="125643742393280" timestamp=1725676299 model="../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf" ERR [ main] exiting due to model loading error | tid="125643742393280" timestamp=1725676299 [1] 2017050 segmentation fault (core dumped) build/bin/llama-server -m -ngl 1 ``` P.S. It fails even without offloading layers to GPU (-ngl 0)

GiteaMirror commented

2026-05-04 00:32:18 -05:00

@MikeLP commented on GitHub (Sep 7, 2024):

@dhiltgen I created issue https://github.com/ggerganov/llama.cpp/issues/9352
@vanife Could you please make some comment there you have the same issue.

@MikeLP commented on GitHub (Sep 7, 2024): @dhiltgen I created issue https://github.com/ggerganov/llama.cpp/issues/9352 @vanife Could you please make some comment there you have the same issue.

GiteaMirror commented

2026-05-04 00:32:20 -05:00

@MikeLP commented on GitHub (Sep 8, 2024):

@dhiltgen Well, I think I understand the problem better now after talking to the guys from llama.cpp. So before, I was pretty sure that llama.cpp handles the context size and offloads it by default to RAM if VRAM is not enough, but it doesn't.

As I've got later, ollama sets a small context size by default (2048 or whatever), but this time (after 0.3.6) it doesn't overwrite it. Instead, it leaves the default large context size (if the user doesn't override it) and tries to load it into VRAM, which results in no available memory (leading to a segmentation fault).

Honestly, as a developer, I don't think any good product (especially a production one) should have "segmentation fault" errors. But this is open-source, so whatever. I don't believe llama.cpp will fix this issue, so I just closed the github ticket.

It would be great if ollama, as a high-level API, could handle this error and automatically calculate the maximum available context size based on the user's available memory (considering we know the model size, quantization, and quantity of layers). However, I understand that this is not a one-day fix but rather a significant feature.

So, maybe the only thing we can look into now is why ollama isn't overriding the default context size in some cases for version 0.3.9.

Or if it's expected behaviour, just need to mention it in the documentation.

@MikeLP commented on GitHub (Sep 8, 2024): @dhiltgen Well, I think I understand the problem better now after talking to the guys from llama.cpp. So before, I was pretty sure that llama.cpp handles the context size and offloads it by default to RAM if VRAM is not enough, but it doesn't. As I've got later, ollama sets a small context size by default (2048 or whatever), but this time (after 0.3.6) it doesn't overwrite it. Instead, it leaves the default large context size (if the user doesn't override it) and tries to load it into VRAM, which results in no available memory (_leading to a segmentation fault_). Honestly, as a developer, I don't think any good product (especially a production one) should have "segmentation fault" errors. But this is open-source, so whatever. I don't believe llama.cpp will fix this issue, so I just closed the github ticket. It would be great if ollama, as a high-level API, could handle this error and automatically calculate the maximum available context size based on the user's available memory (considering we know the model size, quantization, and quantity of layers). However, I understand that this is not a one-day fix but rather a significant feature. So, maybe the only thing we can look into now is why ollama isn't overriding the default context size in some cases for version 0.3.9. Or if it's expected behaviour, just need to mention it in the documentation.

GiteaMirror commented

2026-05-04 00:32:27 -05:00

@vanife commented on GitHub (Sep 8, 2024):

@vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect)

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server

OK. I have made following changes to the system:

updated amdgpu software, installed ROCm v 6.2.0
reinstalled ollama (including scripts needed for AMD GPUs)
removed all NVIDIA GPUs, disabled iGPU, installed even more Radeon Pro VII for a total of 10 right now (some using "risers").
set OLLAMA_DEBUG=1 for the service.

results of rocm-smi

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
========================================================================================================================
0       1     0x66a1,   52718  41.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
1       2     0x66a1,   27240  35.0°C  15.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
2       3     0x66a1,   22570  31.0°C  15.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
3       4     0x66a1,   24674  35.0°C  22.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
4       5     0x66a1,   13217  32.0°C  19.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
5       6     0x66a1,   49889  30.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
6       7     0x66a1,   62886  33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
7       8     0x66a1,   1254   33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
8       9     0x66a1,   34102  32.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
9       10    0x66a1,   46964  33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

Now different models fail (compared to previous tests) and I cannot figure out the reasons because VRAM clearly is not an issue; also RAM itself (in case it is needed to load the models) is not as well as even smaller models fail to load.
I have 64GB RAM and 160GB VRAM (10x 16GB).

The full log trying to run llama3.1:70b-instruct-q5_K_M (fails now as before) is below. On line 577 the segmentation fault error is generated, but I cannot see anything to indicate the reason from extra DEBUG statement:
_llama3.1_70b-instruct-q5_K_M--error.log

However, at this time also llama3.1:70b results in the same error, although it was loading few days ago.

@vanife commented on GitHub (Sep 8, 2024): > @vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect) > > https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server OK. I have made following changes to the system: - updated amdgpu software, installed ROCm v 6.2.0 - reinstalled ollama (including scripts needed for AMD GPUs) - removed all NVIDIA GPUs, disabled iGPU, installed even more Radeon Pro VII for a total of 10 right now (some using "risers"). - set `OLLAMA_DEBUG=1` for the service. **results of `rocm-smi`** ``` =========================================== ROCm System Management Interface =========================================== ===================================================== Concise Info ===================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ======================================================================================================================== 0 1 0x66a1, 52718 41.0°C 17.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 1 2 0x66a1, 27240 35.0°C 15.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 2 3 0x66a1, 22570 31.0°C 15.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 3 4 0x66a1, 24674 35.0°C 22.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 4 5 0x66a1, 13217 32.0°C 19.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 5 6 0x66a1, 49889 30.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 6 7 0x66a1, 62886 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 7 8 0x66a1, 1254 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 8 9 0x66a1, 34102 32.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 9 10 0x66a1, 46964 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% ======================================================================================================================== ================================================= End of ROCm SMI Log ================================================== ``` Now different models fail (compared to previous tests) and I cannot figure out the reasons because VRAM clearly is not an issue; also RAM itself (in case it is needed to load the models) is not as well as even smaller models fail to load. I have 64GB RAM and 160GB VRAM (10x 16GB). The full log trying to run `llama3.1:70b-instruct-q5_K_M` (fails now as before) is below. On line 577 the segmentation fault error is generated, but I cannot see anything to indicate the reason from extra DEBUG statement: [_llama3.1_70b-instruct-q5_K_M--error.log](https://github.com/user-attachments/files/16923455/_llama3.1_70b-instruct-q5_K_M--error.log) However, at this time also `llama3.1:70b` results in the same error, although it was loading few days ago.

GiteaMirror commented

2026-05-04 00:32:35 -05:00

@dhiltgen commented on GitHub (Sep 12, 2024):

Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help.
You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.

@dhiltgen commented on GitHub (Sep 12, 2024): Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.

GiteaMirror commented

2026-05-04 00:32:36 -05:00

@MikeLP commented on GitHub (Sep 15, 2024):

@dhiltgen Unfortunately OLLAMA_GPU_OVERHEAD doesn't work in my case.

@MikeLP commented on GitHub (Sep 15, 2024): @dhiltgen Unfortunately OLLAMA_GPU_OVERHEAD doesn't work in my case.

GiteaMirror commented

2026-05-04 00:32:38 -05:00

@vanife commented on GitHub (Sep 26, 2024):

Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.

I have tried many models and many sizes and came to the following observation 62-64 GB seems to be the breaking point. Here is what i mean by it: as soon as the model size is 62GB or more, the model fails to run with the error in the subject (Error: llama runner process has terminated: signal: segmentation fault (core dumped)).
Also by "model size" i mean what ollama ps shows in the SIZE column, and not the SIZE of the ollama ls.

I tried OLLAMA_NUM_PARALLEL=1 and the llama3.1:70b-instruct-q3_K_L loaded (ls size is "37 GB", ps size is "60 GB", whereas without that parameter it failed as its ollama ps size was "63 GB".

This gave me the hint to try something else: I have 11 identical GPUs (Radeon Pro VII with 16 GB VRAM each).

When i run llama3.1:8b-instruct-fp16 (ls: 16 GB, ps: 34 GB), it works and spreads over all 11 GPUs each ~14% load.
When i run llama3.1:70b-instruct-q3_K_M (ls: 34 GB, ps: 60 GB), it also works spreading work over 11 GPUs at ~30% load.
When i run llama3.1:70b-instruct-q3_K_L (ls: 37 gB, ps: 63 GB), it failed, but ollama ps briefly showed 63 GB (100% GPU).

so here i tried the OLLAMA_NUM_PARALLEL=1, and it allowed it to run
I then used HIP_AVAILABLE_DEVICES to limit number of GPUs, and surprisingly (to me) the ollama ps number was smaller for smaller number of devices: 53 GB for 7 GPUs, 50 GB for 5 GPUs and 45 GB for 3 GPUs, all of which loaded and ran without any problem.

I then repeated the same "test" on qwen2.5:72b-instruct-q3_K_M model, where it worked with 3 to 9 GPUs (61 GB for 9 GPUs), but it failed for 10 or 11 for which ollama ps was showing 63 and 64 GB.

Therefore, I have a strong suspicion that ~64GB something happens.

Any ideas?

Which two very similar settings should I run and provide with the log?

@vanife commented on GitHub (Sep 26, 2024): > Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs. I have tried many models and many sizes and came to the following observation **62-64 GB seems to be the breaking point**. Here is what i mean by it: as soon as the model size is 62GB or more, the model fails to run with the error in the subject (`Error: llama runner process has terminated: signal: segmentation fault (core dumped)`). Also by "model size" i mean what `ollama ps` shows in the `SIZE` column, and not the `SIZE` of the `ollama ls`. I tried `OLLAMA_NUM_PARALLEL=1` and the `llama3.1:70b-instruct-q3_K_L` loaded (`ls size` is "37 GB", `ps size` is "60 GB", whereas without that parameter it failed as its `ollama ps` size was "63 GB". This gave me the hint to try something else: I have 11 identical GPUs (Radeon Pro VII with 16 GB VRAM each). - When i run `llama3.1:8b-instruct-fp16` (ls: 16 GB, ps: 34 GB), it works and spreads over all 11 GPUs each ~14% load. - When i run `llama3.1:70b-instruct-q3_K_M` (ls: 34 GB, ps: 60 GB), it also works spreading work over 11 GPUs at ~30% load. - When i run `llama3.1:70b-instruct-q3_K_L` (ls: 37 gB, ps: 63 GB), it failed, but `ollama ps` briefly showed 63 GB (100% GPU). * so here i tried the `OLLAMA_NUM_PARALLEL=1`, and it allowed it to run * I then used `HIP_AVAILABLE_DEVICES` to limit number of GPUs, and surprisingly (to me) the `ollama ps` number was smaller for smaller number of devices: 53 GB for 7 GPUs, 50 GB for 5 GPUs and 45 GB for 3 GPUs, all of which loaded and ran without any problem. I then repeated the same "test" on `qwen2.5:72b-instruct-q3_K_M` model, where it worked with 3 to 9 GPUs (61 GB for 9 GPUs), but it failed for 10 or 11 for which `ollama ps` was showing 63 and 64 GB. Therefore, I have a strong suspicion that ~64GB something happens. Any ideas? Which two very similar settings should I run and provide with the log?

GiteaMirror commented

2026-05-04 00:32:40 -05:00

@MikeLP commented on GitHub (Sep 26, 2024):

@dhiltgen This info should help.

@MikeLP commented on GitHub (Sep 26, 2024): @dhiltgen This info should help.

GiteaMirror commented

2026-05-04 00:32:44 -05:00

@vanife commented on GitHub (Sep 27, 2024):

Any ideas?
Which two very similar settings should I run and provide with the log?

Please find attached ollama server log output:
ollama-6595-cast2.log
and also a corresponding tmux session .gif file:

@vanife commented on GitHub (Sep 27, 2024): > Any ideas? > Which two very similar settings should I run and provide with the log? Please find attached ollama server log output: [ollama-6595-cast2.log](https://github.com/user-attachments/files/17160864/ollama-6595-cast2.log) and also a corresponding tmux session .gif file: ![ollama-6595-cast2-lossy](https://github.com/user-attachments/assets/e9179854-c847-4456-96c3-6821c31f082a)

GiteaMirror commented

2026-05-04 00:32:48 -05:00

@vanife commented on GitHub (Sep 27, 2024):

... happy to have a zoom call if i can be of further help ...

@vanife commented on GitHub (Sep 27, 2024): _... happy to have a zoom call if i can be of further help ..._

GiteaMirror commented

2026-05-04 00:32:51 -05:00

@MikeLP commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue?
@vanife Does it work in your case?

@MikeLP commented on GitHub (Oct 21, 2024): @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?

GiteaMirror commented

2026-05-04 00:32:53 -05:00

@vanife commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?

give me a day to check.
@MikeLP, do you know how exactly was it fixed? what was the culprit?

@vanife commented on GitHub (Oct 21, 2024): > @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case? give me a day to check. @MikeLP, do you know how exactly was it fixed? what was the culprit?

GiteaMirror commented

2026-05-04 00:32:56 -05:00

@MikeLP commented on GitHub (Oct 21, 2024):

@vanife
Changelog says Fix crashes for AMD GPUs with small system memory, and it doesn't make sense for my system, 128GB EEC DDR5 RAM and 96GB VRAM are not small system memory, but when after update I run llama3.1:405b-instruct-q2_K - it finally worked again.

@MikeLP commented on GitHub (Oct 21, 2024): @vanife Changelog says `Fix crashes for AMD GPUs with small system memory`, and it doesn't make sense for my system, 128GB EEC DDR5 RAM and 96GB VRAM are not small _system memory_, but when after update I run `llama3.1:405b-instruct-q2_K` - it finally worked again.

GiteaMirror commented

2026-05-04 00:32:59 -05:00

@vanife commented on GitHub (Oct 21, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run.
However, all large (I think, all multi-GPU) models generate total rubbish.
In any event, this is another issue, so it makes sense to close this one.

@vanife commented on GitHub (Oct 21, 2024): I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.

GiteaMirror commented

2026-05-04 00:33:02 -05:00

@farshadghodsian commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?

I can also confirm that the latest update fixes similar issue I had with Segment Fault when trying to load large models (LLama3.1-405B-q4). After updating to 0.3.14 today I can now successfully load and run LLama3.1-405b on 5x W7900 GPUs. Runs a bit slow, but happy that it now finally works as I've been troubleshooting this issue the last few days.

Error seen on Ollama 0.3.13:

Fixed after updating to Ollama 0.3.14 with no other change:

@farshadghodsian commented on GitHub (Oct 21, 2024): > @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case? I can also confirm that the latest update fixes similar issue I had with Segment Fault when trying to load large models (LLama3.1-405B-q4). After updating to `0.3.14` today I can now successfully load and run LLama3.1-405b on 5x W7900 GPUs. Runs a bit slow, but happy that it now finally works as I've been troubleshooting this issue the last few days. Error seen on Ollama `0.3.13`: ![Screenshot 2024-10-21 152440](https://github.com/user-attachments/assets/8e99bc8c-c44c-4f00-bf8a-1d03c91ac780) Fixed after updating to Ollama `0.3.14` with no other change: ![Screenshot 2024-10-21 153220](https://github.com/user-attachments/assets/2132b432-82f8-42ef-b2b9-b48074a27880)

GiteaMirror commented

2026-05-04 00:33:08 -05:00

@MikeLP commented on GitHub (Oct 22, 2024):

...However, all large (I think, all multi-GPU) models generate total rubbish.

Interesting, in my case output is good.

@MikeLP commented on GitHub (Oct 22, 2024): > ...However, all large (I think, all multi-GPU) models generate total rubbish. Interesting, in my case output is good.

GiteaMirror commented

2026-05-04 00:33:13 -05:00

@farshadghodsian commented on GitHub (Oct 22, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.

If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.

@farshadghodsian commented on GitHub (Oct 22, 2024): > I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one. If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.

GiteaMirror commented

2026-05-04 00:33:16 -05:00

@vanife commented on GitHub (Oct 22, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.

If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.

Thank you for the hint. I tried it, but it did not help at all.
Previously I had an issue with the model being crossing the 64GB in size (see my comments above), but now i tried various models and configurations and as soon it does not fit in 1 GPU, it produces total garbage; whereas it worked before for those fitting into few.

I have 11x AMD Radeon Pro 7 with 16GB VRAM each. I wonder if somehow the support for them has been removed now, as they are quite dated...

But again - no segmenation fault error anymore, so this must be a different issue now, so this could be closed from my perspective.

@vanife commented on GitHub (Oct 22, 2024): > > I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one. > > If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues. Thank you for the hint. I tried it, but it did not help at all. Previously I had an issue with the model being crossing the 64GB in size (see my comments above), but now i tried various models and configurations and as soon it does not fit in 1 GPU, it produces total garbage; whereas it worked before for those fitting into few. I have 11x AMD Radeon Pro 7 with 16GB VRAM each. I wonder if somehow the support for them has been removed now, as they are quite dated... _But again - no segmenation fault error anymore, so this must be a different issue now, so this could be closed from my perspective._

GiteaMirror commented

2026-05-04 00:33:21 -05:00

@dhiltgen commented on GitHub (Oct 22, 2024):

@vanife if you force the system to under-allocate layers by setting OLLAMA_GPU_OVERHEAD does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation?

How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.

@dhiltgen commented on GitHub (Oct 22, 2024): @vanife if you force the system to under-allocate layers by setting `OLLAMA_GPU_OVERHEAD` does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation? How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.

GiteaMirror commented

2026-05-04 00:33:23 -05:00

@vanife commented on GitHub (Oct 22, 2024):

@vanife if you force the system to under-allocate layers by setting OLLAMA_GPU_OVERHEAD does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation?

How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.

i have just 64GB RAM (and this seem to have been my suspicion in my prior comments where if the total VRAM need was around this number, it failed to load with "segmentation fault" error). search for "Therefore, I have a strong suspicion that ~64GB something happens." text to find my comment above on this topic.

Now (using version 0.3.14) i do not get a "segmentation fault" anymore, but all the models even 20-30GB result in complete garbage as soon as the model does not fit into 1 GPU.

I do not think that setting OLLAMA_GPU_OVERHEAD will change anything, as loading llama3.1:70b-instruct-q5_K_M spreads nicely across GPUs (25-30% VRAM usage across 11, ~60% across 5 and ~73% across 4 GPUs) leaving plenty of free space.

@vanife commented on GitHub (Oct 22, 2024): > @vanife if you force the system to under-allocate layers by setting `OLLAMA_GPU_OVERHEAD` does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation? > > How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior. i have just 64GB RAM (and this seem to have been my suspicion in my prior comments where if the total VRAM need was around this number, it failed to load with "segmentation fault" error). _search for "Therefore, I have a strong suspicion that ~64GB something happens." text to find my comment above on this topic._ Now (using version 0.3.14) i do not get a "segmentation fault" anymore, but all the models even 20-30GB result in complete garbage as soon as the model does not fit into 1 GPU. I do not think that setting `OLLAMA_GPU_OVERHEAD` will change anything, as loading `llama3.1:70b-instruct-q5_K_M` spreads nicely across GPUs (25-30% VRAM usage across 11, ~60% across 5 and ~73% across 4 GPUs) leaving plenty of free space.

GiteaMirror commented

2026-05-04 00:33:25 -05:00

@vanife commented on GitHub (Oct 22, 2024):

OK. the hypothesis that the problem is strictly related to multi-GPU is not correct.

On my PC i have 11 AMD Radeon Pro VII cards (and 2 more Nvidia, which are disabled for LLM usage), but I use PCIe riser/splitters to connect them, as my mobo does not have so many PCIe slots.

I have now enabled only 1 card per splitter using HIP_VISIBLE_DEVICES, and ollama loads without any problem on 3-4 GPUs as long as I only use one GPU per riser/splitter.
In this setup I can only use 4 gpus as i have only 4 pcie slots directy on mobo, which limits total usable GPU VRAM to 64 (4x16), which coincidentally is also my RAM limit. So my prior issue which resulted in segmentation fault could also be related to this setup.

I do not yet understand why this is the case, but it works for llama3.1:70b-instruct-q5_K_M without any problem.

Question: any idea why this might be the case? below is extract from lspc output:

--00.0-[03]--+-00.0 [Radeon Pro VII]

+-00.0-[07-13]----00.0-[08-13]--+-02.0-[0a-0c]----00.0-[0b-0c]----00.0-[0c]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0 [Radeon Pro VII]
|                               +-06.0-[12]--+-00.0  NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER]
|                               \-07.0-[13]--+-00.0  NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER]

+-08.0-[14-21]----00.0-[15-21]--+-01.0-[16-18]----00.0-[17-18]----00.0-[18]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[19-1b]----00.0-[1a-1b]----00.0-[1b]--+-00.0 [Radeon Pro VII]
|                               +-05.0-[1c-1e]----00.0-[1d-1e]----00.0-[1e]--+-00.0 [Radeon Pro VII]
|                               \-07.0-[1f-21]----00.0-[20-21]----00.0-[21]--+-00.0 [Radeon Pro VII]

+-09.0-[22-32]----00.0-[23-32]--+-02.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0 [Radeon Pro VII]
|                               +-06.0-[2d-2f]----00.0-[2e-2f]----00.0-[2f]--+-00.0 [Radeon Pro VII]
|                               \-07.0-[30-32]----00.0-[31-32]----00.0-[32]--+-00.0 [Radeon Pro VII]

@vanife commented on GitHub (Oct 22, 2024): OK. the hypothesis that the problem is strictly related to multi-GPU is not correct. On my PC i have 11 AMD Radeon Pro VII cards (and 2 more Nvidia, which are disabled for LLM usage), but I use PCIe riser/splitters to connect them, as my mobo does not have so many PCIe slots. I have now enabled only 1 card per splitter using `HIP_VISIBLE_DEVICES`, and ollama loads without any problem on 3-4 GPUs as long as I only use one GPU per riser/splitter. In this setup I can only use 4 gpus as i have only 4 pcie slots directy on mobo, which limits total usable GPU VRAM to 64 (4x16), which coincidentally is also my RAM limit. So my prior issue which resulted in segmentation fault could also be related to this setup. I do not yet understand why this is the case, but it works for `llama3.1:70b-instruct-q5_K_M` without any problem. Question: any idea why this might be the case? below is extract from `lspc` output: ``` --00.0-[03]--+-00.0 [Radeon Pro VII] +-00.0-[07-13]----00.0-[08-13]--+-02.0-[0a-0c]----00.0-[0b-0c]----00.0-[0c]--+-00.0 [Radeon Pro VII] | +-03.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0 [Radeon Pro VII] | +-06.0-[12]--+-00.0 NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER] | \-07.0-[13]--+-00.0 NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER] +-08.0-[14-21]----00.0-[15-21]--+-01.0-[16-18]----00.0-[17-18]----00.0-[18]--+-00.0 [Radeon Pro VII] | +-03.0-[19-1b]----00.0-[1a-1b]----00.0-[1b]--+-00.0 [Radeon Pro VII] | +-05.0-[1c-1e]----00.0-[1d-1e]----00.0-[1e]--+-00.0 [Radeon Pro VII] | \-07.0-[1f-21]----00.0-[20-21]----00.0-[21]--+-00.0 [Radeon Pro VII] +-09.0-[22-32]----00.0-[23-32]--+-02.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0 [Radeon Pro VII] | +-03.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0 [Radeon Pro VII] | +-06.0-[2d-2f]----00.0-[2e-2f]----00.0-[2f]--+-00.0 [Radeon Pro VII] | \-07.0-[30-32]----00.0-[31-32]----00.0-[32]--+-00.0 [Radeon Pro VII] ```

GiteaMirror commented

2026-05-04 00:33:27 -05:00

@dhiltgen commented on GitHub (Oct 22, 2024):

The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: 2f0e81e053

This was introduced to workaround problems with multiple GPUs doing P2P copies between them.

I believe setting NCCL_P2P_DISABLE=1 on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?

@dhiltgen commented on GitHub (Oct 22, 2024): The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: https://github.com/ggerganov/llama.cpp/commit/2f0e81e053b41ca28e73a841e7bdbf9820baaa57 This was introduced to workaround problems with multiple GPUs doing P2P copies between them. I believe setting `NCCL_P2P_DISABLE=1` on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?

GiteaMirror commented

2026-05-04 00:33:29 -05:00

@vanife commented on GitHub (Oct 23, 2024):

The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: ggerganov/llama.cpp@2f0e81e

This was introduced to workaround problems with multiple GPUs doing P2P copies between them.

I believe setting NCCL_P2P_DISABLE=1 on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?

Thank you for more ideas. But i do not see how setting this flag changing anything, i do not see it being recognized by the server at runtime. I tried both NCCL_P2P_DISABLE=1 and GGML_CUDA_NO_PEER_COPY=1 in ollama.service config, but the logging does not show any of those being used, and no change to my previous results.

I am pretty sure it has something to do with my PCIe splitters, here is why:

i have 11 gpus on "consumer" mobo, so i use splitters (like this one: https://www.aliexpress.com/item/1005006779220914.html)
when i do not disable any GPU, the model loads, but produces garbage
when i disable GPUs so that only 4 are available (0,1,2,3 of 11 using various combinations), result is the above - garbage
when i enable only 1 gpu per original PCIe on the board (1 connected directly to mobo + 3x using splitters, but only 1 per splitter/mobo-pcie), the model loads and works.

hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this.

again, i think this issue should be closed as the "segmentation fault" is not generated anymore since 0.3.14

@vanife commented on GitHub (Oct 23, 2024): > The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: [ggerganov/llama.cpp@2f0e81e](https://github.com/ggerganov/llama.cpp/commit/2f0e81e053b41ca28e73a841e7bdbf9820baaa57) > > This was introduced to workaround problems with multiple GPUs doing P2P copies between them. > > I believe setting `NCCL_P2P_DISABLE=1` on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps? Thank you for more ideas. But i do not see how setting this flag changing anything, i do not see it being recognized by the server at runtime. I tried both `NCCL_P2P_DISABLE=1` and `GGML_CUDA_NO_PEER_COPY=1` in `ollama.service` config, but the logging does not show any of those being used, and no change to my previous results. I am pretty sure it has something to do with my PCIe splitters, here is why: - i have 11 gpus on "consumer" mobo, so i use splitters (like this one: https://www.aliexpress.com/item/1005006779220914.html) - when i do not disable any GPU, the model loads, but produces garbage - when i disable GPUs so that only 4 are available (0,1,2,3 of 11 using various combinations), result is the above - garbage - when i enable only 1 gpu per original PCIe on the board (1 connected directly to mobo + 3x using splitters, but only 1 per splitter/mobo-pcie), the model loads and works. hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this. --- _again, i think this issue should be closed as the "segmentation fault" is not generated anymore since 0.3.14_

GiteaMirror commented

2026-05-04 00:33:32 -05:00

@saman-amd commented on GitHub (Oct 23, 2024):

Hey @vanife
GGML_CUDA_NO_PEER_COPY=1 is a build flag which needs to be used at compile time, so will not make any changes at runtime. Ollama 0.3.13 had this flag set, the segmentation fault would have been caused if RAM < Model_Size < VRAM
so since you have 64GB RAM, I guess if you use 0.3.13 version and use a model < 64 GB you shouldn't experience the Seg Fault, maybe you could rule that scenario out if you would see the same garbage output with 0.3.13 and a model_size < 64 GB.
I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?

@saman-amd commented on GitHub (Oct 23, 2024): Hey @vanife GGML_CUDA_NO_PEER_COPY=1 is a build flag which needs to be used at compile time, so will not make any changes at runtime. Ollama 0.3.13 had this flag set, the segmentation fault would have been caused if RAM < Model_Size < VRAM so since you have 64GB RAM, I guess if you use 0.3.13 version and use a model < 64 GB you shouldn't experience the Seg Fault, maybe you could rule that scenario out if you would see the same garbage output with 0.3.13 and a model_size < 64 GB. I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?

GiteaMirror commented

2026-05-04 00:33:35 -05:00

@dhiltgen commented on GitHub (Oct 23, 2024):

@vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick.

Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs

should be under same PCI root port
Large BAR enabled
IOMMMU disabled

So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.

@dhiltgen commented on GitHub (Oct 23, 2024): @vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick. Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs - should be under same PCI root port - Large BAR enabled - IOMMMU disabled So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.

GiteaMirror commented

2026-05-04 00:33:41 -05:00

@vanife commented on GitHub (Oct 24, 2024):

I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?

short answer: NO

I ran few more tests and can confirm the following: when i enable only 2 nvidia "GeForce RTX 4070 SUPER (12GB VRAM)" gpus (which are also on one PCIe with a splitter shared also with 2 more "Radeon Pro VII" gpus, ollama models load on both and work properly - no error and no garbage output. I have tested the following modes [name, ls size, ps size, split]:

+==============================+===========+==========+==========================+
|Model                         | Size (ls) | Size (ps)| Comment                  |
+==============================+===========+==========+==========================+
| llama3.1:70b-instruct-q5_K_M | 49.00     | 55.00    | 56%/44% CPU/GPU: 2x4070S |
+------------------------------+-----------+----------+--------------------------+
| qwen2:7b-instruct-fp16       | 15.00     | 17.00    | 100% gpu                 |
+------------------------------+-----------+----------+--------------------------+
| llama3.1:8b-instruct-fp16    | 14.00     | 18.00    | 100% gpu                 |
+==============================+===========+==========+==========================+

So it appears to be limited to AMD cards. I am not sure why this is the case, and would love to find out why and how it can be solved. I would buy a more robust system (mobo+proc+ram) and would offload there some of my Pro7, but first i somehow would like to understand if these gpus are even making sense to use for local llms. happy to explore any other ideas.

@vanife commented on GitHub (Oct 24, 2024): >I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ? **short answer: NO** I ran few more tests and can confirm the following: when i enable only 2 nvidia "GeForce RTX 4070 SUPER (12GB VRAM)" gpus (which are also on one PCIe with a splitter shared also with 2 more "Radeon Pro VII" gpus, ollama models load on both and work properly - no error and no garbage output. I have tested the following modes [name, ls size, ps size, split]: ```markdown +==============================+===========+==========+==========================+ |Model | Size (ls) | Size (ps)| Comment | +==============================+===========+==========+==========================+ | llama3.1:70b-instruct-q5_K_M | 49.00 | 55.00 | 56%/44% CPU/GPU: 2x4070S | +------------------------------+-----------+----------+--------------------------+ | qwen2:7b-instruct-fp16 | 15.00 | 17.00 | 100% gpu | +------------------------------+-----------+----------+--------------------------+ | llama3.1:8b-instruct-fp16 | 14.00 | 18.00 | 100% gpu | +==============================+===========+==========+==========================+ ``` _So it appears to be limited to AMD cards. I am not sure why this is the case, and would love to find out why and how it can be solved. I would buy a more robust system (mobo+proc+ram) and would offload there some of my Pro7, but first i somehow would like to understand if these gpus are even making sense to use for local llms. happy to explore any other ideas._

GiteaMirror commented

2026-05-04 00:33:46 -05:00

@vanife commented on GitHub (Oct 24, 2024):

@vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick.

Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs

should be under same PCI root port

Large BAR enabled

IOMMMU disabled

So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.

what is "same PCI root port"? how do i make sure this is the case?
large bar: this is set in bios
iommu disabled: i set iommu=pt as per previous suggestion, but no change (also tried completely disabled in bios - also no change)
i did set Environment="NCCL_P2P_DISABLE=1" in ollama server config, but no change

@vanife commented on GitHub (Oct 24, 2024): > @vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick. > > Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs > > * should be under same PCI root port > * Large BAR enabled > * IOMMMU disabled > > So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible. - what is "same PCI root port"? how do i make sure this is the case? - large bar: this is set in bios - iommu disabled: i set `iommu=pt` as per previous suggestion, but no change (also tried completely disabled in bios - also no change) - i did set `Environment="NCCL_P2P_DISABLE=1"` in ollama server config, but no change

GiteaMirror commented

2026-05-04 00:33:52 -05:00

@vanife commented on GitHub (Oct 24, 2024):

i am wondering if it could be related to the fact that Pro7 is not actively supported, and some now required features could be the reason for the failure. extract from https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html

@vanife commented on GitHub (Oct 24, 2024): i am wondering if it could be related to the fact that Pro7 is not actively supported, and some now required features could be the reason for the failure. extract from https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html <img width="753" alt="image" src="https://github.com/user-attachments/assets/c827a1ac-8b79-4bcb-a944-f44903a5f2bd">

GiteaMirror commented

2026-05-04 00:33:56 -05:00

@vanife commented on GitHub (Oct 24, 2024):

hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this.

and just to validate this hypothesis further i added also a m.2 to GPU riser, which allowed me to connect one more GPU "directly" to mobo so that it is not sharing anything via splitter (or it is another splitter for just one GPU).
Outcome? => now i can run larger models on 5 (previously 4) GPUs, maximum 1 per PCIe/M.2. but as soon as i use even 2 cards on the same pcie splitter => i get rubbish as result.

Hence the "pcie spltiters" are the cause. i have different brands/types and i tried them all - same result.

Still would be nice to know the "real" reason, and not just situational.

@vanife commented on GitHub (Oct 24, 2024): > hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this. and just to validate this hypothesis further i added also a m.2 to GPU riser, which allowed me to connect one more GPU "directly" to mobo so that it is not sharing anything via splitter (or it is another splitter for just one GPU). Outcome? => now i can run larger models on 5 (previously 4) GPUs, maximum 1 per PCIe/M.2. _but as soon as i use even 2 cards on the same pcie splitter => i get rubbish as result_. Hence the "pcie spltiters" are the cause. i have different brands/types and i tried them all - same result. --- Still would be nice to know the "real" reason, and not just situational.

GiteaMirror commented

2026-05-04 00:33:59 -05:00

@MikeLP commented on GitHub (Oct 26, 2024):

@vanife Idk will it help you, but there are some limitations explained here
https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/install/native_linux/mgpu.html

Hardware
✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection
✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection
X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

And

Important!
Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration.
Ensure the system Power Supply Unit (PSU) has sufficient wattage to support multiple GPUs.

@MikeLP commented on GitHub (Oct 26, 2024): @vanife Idk will it help you, but there are some limitations explained here https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/install/native_linux/mgpu.html > Hardware ✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection ✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection And > Important! Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration. Ensure the system Power Supply Unit (PSU) has sufficient wattage to support multiple GPUs.

GiteaMirror commented

2026-05-04 00:34:03 -05:00

@MikeLP commented on GitHub (Nov 8, 2024):

@dhiltgen Sorry to disturb, but I've encountered a new error specific to version 0.4.0 - 'llama runner process has terminated: exit status 127'.

I'm seeing this issue with all model sizes despite VRAM size, and the logs indicate that VRAM can't recover. Notably, version 0.3.14 works perfectly fine, so it doesn't seem to be a hardware issue. Should I open a new issue since it's related to ROCm, or is this already a known issue like this (https://github.com/ollama/ollama/issues/7542)?

@MikeLP commented on GitHub (Nov 8, 2024): @dhiltgen Sorry to disturb, but I've encountered a new error specific to version 0.4.0 - 'llama runner process has terminated: exit status 127'. I'm seeing this issue with all model sizes despite VRAM size, and the logs indicate that VRAM can't recover. Notably, version 0.3.14 works perfectly fine, so it doesn't seem to be a hardware issue. Should I open a new issue since it's related to ROCm, or is this already a known issue like this (https://github.com/ollama/ollama/issues/7542)?

GiteaMirror commented

2026-05-04 00:34:05 -05:00

@vanife commented on GitHub (Nov 9, 2024):

I can also confirm a new issue where 0.3.14 works (4x AMD GPU setup on linux), but fails on both 0.4.0 and 0.4.1 with "Error: llama runner process has terminated: error loading model: unable to allocate backend buffer".

@vanife commented on GitHub (Nov 9, 2024): I can also confirm a new issue where 0.3.14 works (4x AMD GPU setup on linux), but fails on both 0.4.0 and 0.4.1 with "Error: llama runner process has terminated: error loading model: unable to allocate backend buffer".

GiteaMirror commented

2026-05-04 00:34:09 -05:00

@dhiltgen commented on GitHub (Nov 13, 2024):

@MikeLP @vanife please check your server logs to see what caused the runner to crash. If you don't see any other recent issues that have the same failure, please file a new issue.

@dhiltgen commented on GitHub (Nov 13, 2024): @MikeLP @vanife please check your server logs to see what caused the runner to crash. If you don't see any other recent issues that have the same failure, please file a new issue.

GiteaMirror commented

2026-05-04 00:34:13 -05:00

@dhiltgen commented on GitHub (Feb 25, 2025):

Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.

@dhiltgen commented on GitHub (Feb 25, 2025): Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.

GiteaMirror commented

2026-05-04 00:34:16 -05:00

@vanife commented on GitHub (Feb 26, 2025):

Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.

I cannot say if #7378 is useful, but I do not experience crashes since some time now with and without it been set. Thank you.

@vanife commented on GitHub (Feb 26, 2025): > Is this still a problem with the latest versions? I'm trying to determine if [#7378](https://github.com/ollama/ollama/pull/7378) is still useful. I cannot say if #7378 is useful, but I do not experience crashes since some time now with and without it been set. Thank you.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#66190