[GH-ISSUE #6595] 4 AMD GPUs with mixed VRAM sizes: layer predictions incorrect leads to runner crash #66190

Closed
opened 2026-05-04 00:31:46 -05:00 by GiteaMirror · 43 comments
Owner

Originally created by @MikeLP on GitHub (Sep 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6595

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

When I load a large model that doesn't fit in VRAM, Ollama crashes:

➜ ~ ollama run dbrx:132b-instruct-q8_0
Error: llama runner process has terminated: signal: segmentation fault (core dumped)

This issue does not occur with Ollama 0.3.6.

My hardware:
CPU: AMD Ryzen Threadripper PRO 7965WX 24-Cores
GPU 1: AMD Instinct MI100 [Discrete]
GPU 2 AMD Instinct MI100 [Discrete]
GPU 3: AMD Radeon RX 6900 XT [Discrete]
GPU 4: AMD Radeon VII [Discrete]
VRAM: 96GiB
RAM: 128 GiB

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.7 - 0.3.9

Originally created by @MikeLP on GitHub (Sep 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6595 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? When I load a large model that doesn't fit in VRAM, Ollama crashes: ➜ ~ ollama run dbrx:132b-instruct-q8_0 Error: llama runner process has terminated: signal: segmentation fault (core dumped) This issue does not occur with Ollama 0.3.6. My hardware: CPU: AMD Ryzen Threadripper PRO 7965WX 24-Cores GPU 1: AMD Instinct MI100 [Discrete] GPU 2 AMD Instinct MI100 [Discrete] GPU 3: AMD Radeon RX 6900 XT [Discrete] GPU 4: AMD Radeon VII [Discrete] VRAM: 96GiB RAM: 128 GiB ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.7 - 0.3.9
GiteaMirror added the amdmemorybug labels 2026-05-04 00:31:48 -05:00
Author
Owner

@jmorganca commented on GitHub (Sep 2, 2024):

Thanks for the issue!

<!-- gh-comment-id:2325180640 --> @jmorganca commented on GitHub (Sep 2, 2024): Thanks for the issue!
Author
Owner

@dhiltgen commented on GitHub (Sep 3, 2024):

Can you share your server log?

<!-- gh-comment-id:2326877858 --> @dhiltgen commented on GitHub (Sep 3, 2024): Can you share your [server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md)?
Author
Owner

@MikeLP commented on GitHub (Sep 3, 2024):

@dhiltgen

Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service.
Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.539-07:00 level=INFO source=images.go:753 msg="total blobs: 155"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.541-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.9)"
Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama457793794/runners
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx908
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=1 gpu_type=gfx908
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=2 gpu_type=gfx1030
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=3 gpu_type=gfx906
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=1 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=2 library=rocm variant="" compute=gfx1030 driver=6.8 name=1002:73bf total="16.0 GiB" available="12.0 GiB"
Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=3 library=rocm variant="" compute=gfx906 driver=6.8 name=1002:66af total="16.0 GiB" available="16.0 GiB"
Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 |       33.22µs |       127.0.0.1 | HEAD     "/"
Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 |   23.079446ms |       127.0.0.1 | POST     "/api/show"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.285-07:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=45 layers.split=17,17,4,7 memory.available="[32.0 GiB 32.0 GiB 12.0 GiB 16.0 GiB]" memory.required.full="122.9 GiB" memory.required.partial="90.1 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[31.5 GiB 31.5 GiB 11.2 GiB 15.9 GiB]" memory.weights.total="100.1 GiB" memory.weights.repeating="97.0 GiB" memory.weights.nonrepeating="3.1 GiB" memory.graph.full="2.9 GiB" memory.graph.partial="2.9 GiB"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama457793794/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 45 --no-mmap --parallel 1 --tensor-split 17,17,4,7 --port 34797"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] build info | build=1 commit="1e6f655" tid="132514998158144" timestamp=1725393646
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] system info | n_threads=24 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132514998158144" timestamp=1725393646 total_threads=48
Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="34797" tid="132514998158144" timestamp=1725393646
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: loaded meta data with 34 key-value pairs and 642 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 (version GGUF V3 (latest))
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   0:                       general.architecture str              = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   2:                               general.name str              = C4Ai Command R Plus 08 2024
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   3:                            general.version str              = 08-2024
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   4:                           general.basename str              = c4ai-command-r-plus
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   5:                         general.size_label str              = 104B
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   6:                            general.license str              = cc-by-nc-4.0
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   7:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   8:                      command-r.block_count u32              = 64
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv   9:                   command-r.context_length u32              = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  10:                 command-r.embedding_length u32              = 12288
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  11:              command-r.feed_forward_length u32              = 33792
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  12:             command-r.attention.head_count u32              = 96
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  13:          command-r.attention.head_count_kv u32              = 8
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  14:                   command-r.rope.freq_base f32              = 8000000.000000
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  15:     command-r.attention.layer_norm_epsilon f32              = 0.000010
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  16:                          general.file_type u32              = 7
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  17:                      command-r.logit_scale f32              = 0.833333
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  18:                command-r.rope.scaling.type str              = none
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 5
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 255001
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  29:           tokenizer.chat_template.tool_use str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  30:                tokenizer.chat_template.rag str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  31:                   tokenizer.chat_templates arr[str,2]       = ["rag", "tool_use"]
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv  33:               general.quantization_version u32              = 2
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type  f32:  193 tensors
Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type q8_0:  449 tensors
Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.539-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: special tokens cache size = 37
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: token to piece cache size = 1.8426 MB
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: arch             = command-r
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab type       = BPE
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_vocab          = 256000
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_merges         = 253333
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab_only       = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_train      = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd           = 12288
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_layer          = 64
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head           = 96
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head_kv        = 8
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_rot            = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_swa            = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_k    = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_v    = 128
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_gqa            = 12
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_logit_scale    = 8.3e-01
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ff             = 33792
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert         = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert_used    = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: causal attn      = 1
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: pooling type     = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope type        = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope scaling     = none
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_base_train  = 8000000.0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_scale_train = 1
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope_finetuned   = unknown
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_conv       = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_inner      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_state      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model type       = ?B
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model ftype      = Q8_0
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model params     = 103.81 B
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model size       = 102.73 GiB (8.50 BPW)
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: general.name     = C4Ai Command R Plus 08 2024
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: PAD token        = 0 '<PAD>'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: LF token         = 136 'Ä'
Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: max token length = 1024
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: found 4 ROCm devices:
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 2: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]:   Device 3: AMD Radeon VII, compute capability 9.0, VMM: no
Sep 03 13:00:47 iLinux ollama[1114471]: llm_load_tensors: ggml ctx size =    1.47 MiB
Sep 03 13:00:49 iLinux ollama[1114471]: time=2024-09-03T13:00:49.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:00:53 iLinux ollama[1114471]: time=2024-09-03T13:00:53.569-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:55 iLinux ollama[1114471]: time=2024-09-03T13:00:55.022-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:00:58 iLinux ollama[1114471]: time=2024-09-03T13:00:58.957-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloading 45 repeating layers to GPU
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloaded 45/65 layers to GPU
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm0 buffer size = 27095.41 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm1 buffer size = 27095.41 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm2 buffer size =  6375.39 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:      ROCm3 buffer size = 11156.93 MiB
Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors:  ROCm_Host buffer size = 36658.15 MiB
Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.160-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s |       127.0.0.1 | POST     "/api/chat"
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
<!-- gh-comment-id:2327347204 --> @MikeLP commented on GitHub (Sep 3, 2024): @dhiltgen ```log Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service. Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.539-07:00 level=INFO source=images.go:753 msg="total blobs: 155" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.541-07:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.9)" Sep 03 12:59:56 iLinux ollama[1114471]: time=2024-09-03T12:59:56.542-07:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama457793794/runners Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.601-07:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=0 gpu_type=gfx908 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=1 gpu_type=gfx908 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.608-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=2 gpu_type=gfx1030 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=amd_linux.go:345 msg="amdgpu is supported" gpu=3 gpu_type=gfx906 Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=1 library=rocm variant="" compute=gfx908 driver=6.8 name=1002:738c total="32.0 GiB" available="32.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=2 library=rocm variant="" compute=gfx1030 driver=6.8 name=1002:73bf total="16.0 GiB" available="12.0 GiB" Sep 03 13:00:01 iLinux ollama[1114471]: time=2024-09-03T13:00:01.609-07:00 level=INFO source=types.go:107 msg="inference compute" id=3 library=rocm variant="" compute=gfx906 driver=6.8 name=1002:66af total="16.0 GiB" available="16.0 GiB" Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 | 33.22µs | 127.0.0.1 | HEAD "/" Sep 03 13:00:46 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:00:46 | 200 | 23.079446ms | 127.0.0.1 | POST "/api/show" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.285-07:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=45 layers.split=17,17,4,7 memory.available="[32.0 GiB 32.0 GiB 12.0 GiB 16.0 GiB]" memory.required.full="122.9 GiB" memory.required.partial="90.1 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[31.5 GiB 31.5 GiB 11.2 GiB 15.9 GiB]" memory.weights.total="100.1 GiB" memory.weights.repeating="97.0 GiB" memory.weights.nonrepeating="3.1 GiB" memory.graph.full="2.9 GiB" memory.graph.partial="2.9 GiB" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama457793794/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 45 --no-mmap --parallel 1 --tensor-split 17,17,4,7 --port 34797" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.287-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] build info | build=1 commit="1e6f655" tid="132514998158144" timestamp=1725393646 Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] system info | n_threads=24 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="132514998158144" timestamp=1725393646 total_threads=48 Sep 03 13:00:46 iLinux ollama[1118588]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="47" port="34797" tid="132514998158144" timestamp=1725393646 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: loaded meta data with 34 key-value pairs and 642 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 (version GGUF V3 (latest)) Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 0: general.architecture str = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 1: general.type str = model Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 2: general.name str = C4Ai Command R Plus 08 2024 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 3: general.version str = 08-2024 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 4: general.basename str = c4ai-command-r-plus Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 5: general.size_label str = 104B Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 6: general.license str = cc-by-nc-4.0 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 7: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 8: command-r.block_count u32 = 64 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 9: command-r.context_length u32 = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 10: command-r.embedding_length u32 = 12288 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 11: command-r.feed_forward_length u32 = 33792 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 12: command-r.attention.head_count u32 = 96 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 13: command-r.attention.head_count_kv u32 = 8 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 14: command-r.rope.freq_base f32 = 8000000.000000 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 15: command-r.attention.layer_norm_epsilon f32 = 0.000010 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 16: general.file_type u32 = 7 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 17: command-r.logit_scale f32 = 0.833333 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 18: command-r.rope.scaling.type str = none Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 5 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 255001 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = true Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 29: tokenizer.chat_template.tool_use str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 30: tokenizer.chat_template.rag str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 31: tokenizer.chat_templates arr[str,2] = ["rag", "tool_use"] Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - kv 33: general.quantization_version u32 = 2 Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type f32: 193 tensors Sep 03 13:00:46 iLinux ollama[1114471]: llama_model_loader: - type q8_0: 449 tensors Sep 03 13:00:46 iLinux ollama[1114471]: time=2024-09-03T13:00:46.539-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: special tokens cache size = 37 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_vocab: token to piece cache size = 1.8426 MB Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: format = GGUF V3 (latest) Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: arch = command-r Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab type = BPE Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_vocab = 256000 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_merges = 253333 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: vocab_only = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_train = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd = 12288 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_layer = 64 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head = 96 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_head_kv = 8 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_rot = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_swa = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_k = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_head_v = 128 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_gqa = 12 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_eps = 1.0e-05 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: f_logit_scale = 8.3e-01 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ff = 33792 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_expert_used = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: causal attn = 1 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: pooling type = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope type = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope scaling = none Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_base_train = 8000000.0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: freq_scale_train = 1 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: rope_finetuned = unknown Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_conv = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_inner = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_d_state = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: ssm_dt_rank = 0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model type = ?B Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model ftype = Q8_0 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model params = 103.81 B Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: model size = 102.73 GiB (8.50 BPW) Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: general.name = C4Ai Command R Plus 08 2024 Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: PAD token = 0 '<PAD>' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: LF token = 136 'Ä' Sep 03 13:00:46 iLinux ollama[1114471]: llm_load_print_meta: max token length = 1024 Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 03 13:00:47 iLinux ollama[1114471]: ggml_cuda_init: found 4 ROCm devices: Sep 03 13:00:47 iLinux ollama[1114471]: Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 2: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: Device 3: AMD Radeon VII, compute capability 9.0, VMM: no Sep 03 13:00:47 iLinux ollama[1114471]: llm_load_tensors: ggml ctx size = 1.47 MiB Sep 03 13:00:49 iLinux ollama[1114471]: time=2024-09-03T13:00:49.245-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:00:53 iLinux ollama[1114471]: time=2024-09-03T13:00:53.569-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:55 iLinux ollama[1114471]: time=2024-09-03T13:00:55.022-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:00:58 iLinux ollama[1114471]: time=2024-09-03T13:00:58.957-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloading 45 repeating layers to GPU Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: offloaded 45/65 layers to GPU Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm0 buffer size = 27095.41 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm1 buffer size = 27095.41 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm2 buffer size = 6375.39 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm3 buffer size = 11156.93 MiB Sep 03 13:00:59 iLinux ollama[1114471]: llm_load_tensors: ROCm_Host buffer size = 36658.15 MiB Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.160-07:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat" Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 ```
Author
Owner

@vanife commented on GitHub (Sep 4, 2024):

@dhiltgen

Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service.
Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config
....
....
....

msg="waiting for server to become available" status="llm server not responding"

Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat"
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8
Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8

I have similar problem with 6 AMD Radeon Pro VII for larger models.
I am wondering if this is because my RAM (not VRAM) is only 64GB. Does it need to be larger that the VRAM requirement of the model?

<!-- gh-comment-id:2329287872 --> @vanife commented on GitHub (Sep 4, 2024): > @dhiltgen > > ``` > Sep 03 12:59:56 iLinux systemd[1]: Started ollama.service - Ollama Service. > Sep 03 12:59:56 iLinux ollama[1114471]: 2024/09/03 12:59:56 routes.go:1125: INFO server config > .... > .... > .... msg="waiting for server to become available" status="llm server not responding" > Sep 03 13:01:00 iLinux ollama[1114471]: time=2024-09-03T13:01:00.411-07:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" > Sep 03 13:01:00 iLinux ollama[1114471]: [GIN] 2024/09/03 - 13:01:00 | 500 | 14.178486292s | 127.0.0.1 | POST "/api/chat" > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.412-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001075302 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.662-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250584182 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > Sep 03 13:01:05 iLinux ollama[1114471]: time=2024-09-03T13:01:05.912-07:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501113953 model=/usr/share/ollama/.ollama/models/blobs/sha256-4912a1576bf1e5814a568b5bfa497aa68e010112543291e04d0ab395a2daeff8 > ``` I have similar problem with 6 AMD Radeon Pro VII for larger models. I am wondering if this is because my RAM (not VRAM) is only 64GB. Does it need to be larger that the VRAM requirement of the model?
Author
Owner

@dhiltgen commented on GitHub (Sep 4, 2024):

@MikeLP as a workaround, are you able to reduce the number of layers loaded via num_gpu to get it to load, and if so, how much did we over-shoot?

@vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.

<!-- gh-comment-id:2329425060 --> @dhiltgen commented on GitHub (Sep 4, 2024): @MikeLP as a workaround, are you able to reduce the number of layers loaded via `num_gpu` to get it to load, and if so, how much did we over-shoot? @vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.
Author
Owner

@vanife commented on GitHub (Sep 4, 2024):

@MikeLP as a workaround, are you able to reduce the number of layers loaded via num_gpu to get it to load, and if so, how much did we over-shoot?

@vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash.

Thank you, @dhiltgen.
I have 96GB total VRAM (6x 16GB).

For me these do work: qwen2:72b (41GB), llama3.1:70b (39 GB, 58% of 6 GPUs). But llama3.1:70b-instruct-q5_K_M (49GB) does not load anymore, even though it clearly has sufficient VRAM to load the whole model (and RAM as well as 64GB should be sufficient, I think).

This is the result of me running the following command (on ubuntu 22.04): OLLAMA_DEBUG=1 ollama run llama3.1:70b-instruct-q5_K_M (which should require ~49GB VRAM out of my 96GB available):

Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 |      19.045µs |       127.0.0.1 | HEAD     "/"
Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 |   11.827211ms |       127.0.0.1 | POST     "/api/show"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd library=rocm parallel=4 required="61.4 GiB"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=14,14,14,13,13,13 memory.available="[15.0 GiB 15.0 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.required.full="61.4 GiB" memory.required.partial="61.4 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[10.7 GiB 10.4 GiB 10.6 GiB 10.1 GiB 9.8 GiB 9.8 GiB]" memory.weights.total="47.5 GiB" memory.weights.repeating="46.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama3065803334/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --tensor-split 14,14,14,13,13,13 --port 36599"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] build info | build=1 commit="1e6f655" tid="131619309900608" timestamp=1725468351
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131619309900608" timestamp=1725468351 total_threads=32
Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36599" tid="131619309900608" timestamp=1725468351
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd (version GGUF V3 (latest))
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   5:                         general.size_label str              = 70B
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv   9:                          llama.block_count u32              = 80
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  17:                          general.file_type u32              = 17
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type  f32:  162 tensors
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q5_K:  481 tensors
Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q6_K:   81 tensors
Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.959+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 18:45:51 mypc ollama[136853]: llm_load_vocab: special tokens cache size = 256
Sep 04 18:45:52 mypc ollama[136853]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: arch             = llama
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab type       = BPE
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_vocab          = 128256
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_merges         = 280147
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab_only       = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_train      = 131072
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd           = 8192
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_layer          = 80
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head           = 64
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head_kv        = 8
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_rot            = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_swa            = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_k    = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_v    = 128
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_gqa            = 8
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ff             = 28672
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert         = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert_used    = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: causal attn      = 1
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: pooling type     = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope type        = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope scaling     = linear
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_scale_train = 1
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope_finetuned   = unknown
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_conv       = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_inner      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_state      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model type       = 70B
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model ftype      = Q5_K - Medium
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model params     = 70.55 B
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model size       = 46.51 GiB (5.66 BPW)
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: max token length = 256
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: found 6 ROCm devices:
Sep 04 18:45:52 mypc ollama[136853]:   Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 1: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 2: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 3: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 4: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]:   Device 5: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  8193.82 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  8008.94 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  7978.13 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  7447.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  7417.07 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  7893.67 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:  ROCm_Host buffer size =   688.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 |  1.797167336s |       127.0.0.1 | POST     "/api/chat"
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
<!-- gh-comment-id:2329570725 --> @vanife commented on GitHub (Sep 4, 2024): > @MikeLP as a workaround, are you able to reduce the number of layers loaded via `num_gpu` to get it to load, and if so, how much did we over-shoot? > > @vanife for large models split between GPU and CPU, yes, it can require significant system memory. Usually crashes related to host memory will report as such in the logs, but not always. If you run with OLLAMA_DEBUG=1 set on the server it will report more information about system memory, free swap space, etc which may help identify the cause of a crash. Thank you, @dhiltgen. I have 96GB total VRAM (6x 16GB). For me these do work: `qwen2:72b` (41GB), `llama3.1:70b` (39 GB, 58% of 6 GPUs). But `llama3.1:70b-instruct-q5_K_M` (49GB) does not load anymore, even though it clearly has sufficient VRAM to load the whole model (and RAM as well as 64GB should be sufficient, I think). This is the result of me running the following command (on ubuntu 22.04): ` OLLAMA_DEBUG=1 ollama run llama3.1:70b-instruct-q5_K_M` (which should require ~49GB VRAM out of my 96GB available): ``` Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 | 19.045µs | 127.0.0.1 | HEAD "/" Sep 04 18:45:51 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:51 | 200 | 11.827211ms | 127.0.0.1 | POST "/api/show" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd library=rocm parallel=4 required="61.4 GiB" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.706+02:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=14,14,14,13,13,13 memory.available="[15.0 GiB 15.0 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.required.full="61.4 GiB" memory.required.partial="61.4 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[10.7 GiB 10.4 GiB 10.6 GiB 10.1 GiB 9.8 GiB 9.8 GiB]" memory.weights.total="47.5 GiB" memory.weights.repeating="46.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama3065803334/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --tensor-split 14,14,14,13,13,13 --port 36599" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.708+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] build info | build=1 commit="1e6f655" tid="131619309900608" timestamp=1725468351 Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="131619309900608" timestamp=1725468351 total_threads=32 Sep 04 18:45:51 mypc ollama[2482468]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36599" tid="131619309900608" timestamp=1725468351 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd (version GGUF V3 (latest)) Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 0: general.architecture str = llama Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 1: general.type str = model Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 5: general.size_label str = 70B Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 9: llama.block_count u32 = 80 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 17: general.file_type u32 = 17 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type f32: 162 tensors Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q5_K: 481 tensors Sep 04 18:45:51 mypc ollama[136853]: llama_model_loader: - type q6_K: 81 tensors Sep 04 18:45:51 mypc ollama[136853]: time=2024-09-04T18:45:51.959+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 18:45:51 mypc ollama[136853]: llm_load_vocab: special tokens cache size = 256 Sep 04 18:45:52 mypc ollama[136853]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: format = GGUF V3 (latest) Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: arch = llama Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab type = BPE Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_vocab = 128256 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_merges = 280147 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: vocab_only = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_train = 131072 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd = 8192 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_layer = 80 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head = 64 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_head_kv = 8 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_rot = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_swa = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_k = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_head_v = 128 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_gqa = 8 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ff = 28672 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_expert_used = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: causal attn = 1 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: pooling type = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope type = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope scaling = linear Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_base_train = 500000.0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: freq_scale_train = 1 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: rope_finetuned = unknown Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_conv = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_inner = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_d_state = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: ssm_dt_rank = 0 Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model type = 70B Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model ftype = Q5_K - Medium Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model params = 70.55 B Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: model size = 46.51 GiB (5.66 BPW) Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: LF token = 128 'Ä' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 04 18:45:52 mypc ollama[136853]: llm_load_print_meta: max token length = 256 Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 04 18:45:52 mypc ollama[136853]: ggml_cuda_init: found 6 ROCm devices: Sep 04 18:45:52 mypc ollama[136853]: Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 1: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 2: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 3: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 4: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: Device 5: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 8193.82 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 8008.94 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 7978.13 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 7447.88 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 7417.07 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 7893.67 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm_Host buffer size = 688.88 MiB Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 | 1.797167336s | 127.0.0.1 | POST "/api/chat" Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd ```
Author
Owner

@vanife commented on GitHub (Sep 4, 2024):

I also tried the "success" scenario with this result:
Command: OLLAMA_DEBUG=1 ollama run llama3.1:70b

rocm-smi result once the client started loading:

=============================================== ROCm System Management Interface ===============================================
========================================================= Concise Info =========================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK     Fan     Perf    PwrCap       VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
================================================================================================================================
0       1     0x66a1,   7068   50.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       55%    0%
1       2     0x66a1,   20169  45.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
2       3     0x66a1,   40303  41.0°C  22.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
3       4     0x66a1,   63425  44.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
4       5     0x66a1,   53400  39.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       58%    0%
5       6     0x66a1,   11634  41.0°C  19.0W     N/A, N/A, 0         860Mhz  350Mhz   34.51%  manual  144.0W       55%    0%
================================================================================================================================
===================================================== End of ROCm SMI Log ======================================================

ollama ps:

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3.1:70b    d729c66f84de    55 GB   100% GPU        4 minutes from now

and the result of the journalctl --since "15 minutes ago" -u ollama --no-pager is the same as "failure" scenario until the llm_load_tensors block starts, so I will post only the rest of the result since then:

failure (same as previous post) llama3.1:70b-instruct-q5_K_M:

Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  8193.82 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  8008.94 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  7978.13 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  7447.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  7417.07 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  7893.67 MiB
Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors:  ROCm_Host buffer size =   688.88 MiB
Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 |  1.797167336s |       127.0.0.1 | POST     "/api/chat"
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd
Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd

success (llama3.1:70b):

Sep 04 19:22:51 mypc ollama[136853]: llm_load_tensors: ggml ctx size =    2.37 MiB
Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.474+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.921+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm0 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm1 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm2 buffer size =  6426.88 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm3 buffer size =  5967.81 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm4 buffer size =  5967.81 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:      ROCm5 buffer size =  6330.73 MiB
Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors:        CPU buffer size =   563.62 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ctx      = 8192
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_batch    = 512
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ubatch   = 512
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: flash_attn = 0
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_base  = 500000.0
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_scale = 1
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm0 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm1 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm2 KV buffer size =   448.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm3 KV buffer size =   416.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm4 KV buffer size =   416.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init:      ROCm5 KV buffer size =   384.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:  ROCm_Host  output buffer size =     2.08 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm0 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm1 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm2 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm3 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm4 compute buffer size =  1216.01 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:      ROCm5 compute buffer size =  1216.02 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model:  ROCm_Host compute buffer size =    80.02 MiB
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph nodes  = 2566
Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph splits = 7
Sep 04 19:24:02 mypc ollama[2536770]: INFO [main] model loaded | tid="140583704699712" timestamp=1725470642
Sep 04 19:24:02 mypc ollama[136853]: time=2024-09-04T19:24:02.863+02:00 level=INFO source=server.go:630 msg="llama runner started in 72.35 seconds"

Any other ideas what I could try?

<!-- gh-comment-id:2329643659 --> @vanife commented on GitHub (Sep 4, 2024): I also tried the "success" scenario with this result: Command: `OLLAMA_DEBUG=1 ollama run llama3.1:70b` `rocm-smi` result once the client started loading: ``` =============================================== ROCm System Management Interface =============================================== ========================================================= Concise Info ========================================================= Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ================================================================================================================================ 0 1 0x66a1, 7068 50.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 55% 0% 1 2 0x66a1, 20169 45.0°C 17.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 2 3 0x66a1, 40303 41.0°C 22.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 3 4 0x66a1, 63425 44.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 4 5 0x66a1, 53400 39.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 58% 0% 5 6 0x66a1, 11634 41.0°C 19.0W N/A, N/A, 0 860Mhz 350Mhz 34.51% manual 144.0W 55% 0% ================================================================================================================================ ===================================================== End of ROCm SMI Log ====================================================== ``` `ollama ps`: ``` NAME ID SIZE PROCESSOR UNTIL llama3.1:70b d729c66f84de 55 GB 100% GPU 4 minutes from now ``` and the result of the `journalctl --since "15 minutes ago" -u ollama --no-pager` is the same as "failure" scenario until the `llm_load_tensors` block starts, so I will post only the rest of the result since then: **failure (same as previous post) `llama3.1:70b-instruct-q5_K_M`**: ``` Sep 04 18:45:52 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 8193.82 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 8008.94 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 7978.13 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 7447.88 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 7417.07 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 7893.67 MiB Sep 04 18:45:53 mypc ollama[136853]: llm_load_tensors: ROCm_Host buffer size = 688.88 MiB Sep 04 18:45:53 mypc ollama[136853]: time=2024-09-04T18:45:53.464+02:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 04 18:45:53 mypc ollama[136853]: [GIN] 2024/09/04 - 18:45:53 | 500 | 1.797167336s | 127.0.0.1 | POST "/api/chat" Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.465+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000882851 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.714+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250372171 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd Sep 04 18:45:58 mypc ollama[136853]: time=2024-09-04T18:45:58.964+02:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500648232 model=/usr/share/ollama/.ollama/models/blobs/sha256-f8f84c9d64218d440bcee215e1d298a64cb4fde44df2c0b4791482bb16152ebd ``` **success (`llama3.1:70b`):** ``` Sep 04 19:22:51 mypc ollama[136853]: llm_load_tensors: ggml ctx size = 2.37 MiB Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.474+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 04 19:22:53 mypc ollama[136853]: time=2024-09-04T19:22:53.921+02:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading 80 repeating layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloading non-repeating layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: offloaded 81/81 layers to GPU Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm0 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm1 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm2 buffer size = 6426.88 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm3 buffer size = 5967.81 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm4 buffer size = 5967.81 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: ROCm5 buffer size = 6330.73 MiB Sep 04 19:22:53 mypc ollama[136853]: llm_load_tensors: CPU buffer size = 563.62 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ctx = 8192 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_batch = 512 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: n_ubatch = 512 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: flash_attn = 0 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_base = 500000.0 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: freq_scale = 1 Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm0 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm1 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm2 KV buffer size = 448.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm3 KV buffer size = 416.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm4 KV buffer size = 416.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_kv_cache_init: ROCm5 KV buffer size = 384.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm_Host output buffer size = 2.08 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm0 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm1 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm2 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm3 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm4 compute buffer size = 1216.01 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm5 compute buffer size = 1216.02 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: ROCm_Host compute buffer size = 80.02 MiB Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph nodes = 2566 Sep 04 19:23:58 mypc ollama[136853]: llama_new_context_with_model: graph splits = 7 Sep 04 19:24:02 mypc ollama[2536770]: INFO [main] model loaded | tid="140583704699712" timestamp=1725470642 Sep 04 19:24:02 mypc ollama[136853]: time=2024-09-04T19:24:02.863+02:00 level=INFO source=server.go:630 msg="llama runner started in 72.35 seconds" ``` Any other ideas what I could try?
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

@vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect)

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server

<!-- gh-comment-id:2332105132 --> @dhiltgen commented on GitHub (Sep 5, 2024): @vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect) https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server
Author
Owner

@MikeLP commented on GitHub (Sep 7, 2024):

@dhiltgen I was playing with a manual build of llama.cpp, and it's definitely a bug in llama.cpp.
Because before that, I didn't have any issues running large models. So it looks like starting from version 0.3.7, ollama uses the latest llama.cpp which has this bug.

Here is my build command

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 48

I tried offloading all layers and then just one layer, but it still crashes.

Output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Radeon VII, compute capability 9.0, VMM: no
  Device 1: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no
  Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.68 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size =   518.88 MiB
llm_load_tensors:        CPU buffer size = 40543.11 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   512.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size = 40448.00 MiB
llama_new_context_with_model: KV self size  = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.98 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 18035509504
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf'
 ERR [              load_model] unable to load model | tid="125643742393280" timestamp=1725676299 model="../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf"
 ERR [                    main] exiting due to model loading error | tid="125643742393280" timestamp=1725676299
[1]    2017050 segmentation fault (core dumped)  build/bin/llama-server -m  -ngl 1

P.S.
It fails even without offloading layers to GPU (-ngl 0)

<!-- gh-comment-id:2335009309 --> @MikeLP commented on GitHub (Sep 7, 2024): @dhiltgen I was playing with a manual build of llama.cpp, and it's definitely a bug in llama.cpp. Because before that, I didn't have any issues running large models. So it looks like starting from version 0.3.7, ollama uses the latest llama.cpp which has this bug. Here is my build command ```sh HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release \ && cmake --build build --config Release -- -j 48 ``` I tried offloading all layers and then just one layer, but it still crashes. Output: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 ROCm devices: Device 0: AMD Radeon VII, compute capability 9.0, VMM: no Device 1: AMD Radeon RX 6900 XT, compute capability 10.3, VMM: no Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no llm_load_tensors: ggml ctx size = 0.68 MiB llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/81 layers to GPU llm_load_tensors: ROCm0 buffer size = 518.88 MiB llm_load_tensors: CPU buffer size = 40543.11 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 131072 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 40448.00 MiB llama_new_context_with_model: KV self size = 40960.00 MiB, K (f16): 20480.00 MiB, V (f16): 20480.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17200.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 18035509504 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf' ERR [ load_model] unable to load model | tid="125643742393280" timestamp=1725676299 model="../../../Downloads/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf" ERR [ main] exiting due to model loading error | tid="125643742393280" timestamp=1725676299 [1] 2017050 segmentation fault (core dumped) build/bin/llama-server -m -ngl 1 ``` P.S. It fails even without offloading layers to GPU (-ngl 0)
Author
Owner

@MikeLP commented on GitHub (Sep 7, 2024):

@dhiltgen I created issue https://github.com/ggerganov/llama.cpp/issues/9352
@vanife Could you please make some comment there you have the same issue.

<!-- gh-comment-id:2335793274 --> @MikeLP commented on GitHub (Sep 7, 2024): @dhiltgen I created issue https://github.com/ggerganov/llama.cpp/issues/9352 @vanife Could you please make some comment there you have the same issue.
Author
Owner

@MikeLP commented on GitHub (Sep 8, 2024):

@dhiltgen Well, I think I understand the problem better now after talking to the guys from llama.cpp. So before, I was pretty sure that llama.cpp handles the context size and offloads it by default to RAM if VRAM is not enough, but it doesn't.

As I've got later, ollama sets a small context size by default (2048 or whatever), but this time (after 0.3.6) it doesn't overwrite it. Instead, it leaves the default large context size (if the user doesn't override it) and tries to load it into VRAM, which results in no available memory (leading to a segmentation fault).

Honestly, as a developer, I don't think any good product (especially a production one) should have "segmentation fault" errors. But this is open-source, so whatever. I don't believe llama.cpp will fix this issue, so I just closed the github ticket.

It would be great if ollama, as a high-level API, could handle this error and automatically calculate the maximum available context size based on the user's available memory (considering we know the model size, quantization, and quantity of layers). However, I understand that this is not a one-day fix but rather a significant feature.

So, maybe the only thing we can look into now is why ollama isn't overriding the default context size in some cases for version 0.3.9.

Or if it's expected behaviour, just need to mention it in the documentation.

<!-- gh-comment-id:2336757144 --> @MikeLP commented on GitHub (Sep 8, 2024): @dhiltgen Well, I think I understand the problem better now after talking to the guys from llama.cpp. So before, I was pretty sure that llama.cpp handles the context size and offloads it by default to RAM if VRAM is not enough, but it doesn't. As I've got later, ollama sets a small context size by default (2048 or whatever), but this time (after 0.3.6) it doesn't overwrite it. Instead, it leaves the default large context size (if the user doesn't override it) and tries to load it into VRAM, which results in no available memory (_leading to a segmentation fault_). Honestly, as a developer, I don't think any good product (especially a production one) should have "segmentation fault" errors. But this is open-source, so whatever. I don't believe llama.cpp will fix this issue, so I just closed the github ticket. It would be great if ollama, as a high-level API, could handle this error and automatically calculate the maximum available context size based on the user's available memory (considering we know the model size, quantization, and quantity of layers). However, I understand that this is not a one-day fix but rather a significant feature. So, maybe the only thing we can look into now is why ollama isn't overriding the default context size in some cases for version 0.3.9. Or if it's expected behaviour, just need to mention it in the documentation.
Author
Owner

@vanife commented on GitHub (Sep 8, 2024):

@vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect)

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server

OK. I have made following changes to the system:

  • updated amdgpu software, installed ROCm v 6.2.0
  • reinstalled ollama (including scripts needed for AMD GPUs)
  • removed all NVIDIA GPUs, disabled iGPU, installed even more Radeon Pro VII for a total of 10 right now (some using "risers").
  • set OLLAMA_DEBUG=1 for the service.

results of rocm-smi

=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)
========================================================================================================================
0       1     0x66a1,   52718  41.0°C  17.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
1       2     0x66a1,   27240  35.0°C  15.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
2       3     0x66a1,   22570  31.0°C  15.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
3       4     0x66a1,   24674  35.0°C  22.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
4       5     0x66a1,   13217  32.0°C  19.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
5       6     0x66a1,   49889  30.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
6       7     0x66a1,   62886  33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
7       8     0x66a1,   1254   33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
8       9     0x66a1,   34102  32.0°C  20.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
9       10    0x66a1,   46964  33.0°C  21.0W     N/A, N/A, 0         860Mhz  350Mhz  9.41%  auto  190.0W  0%     0%
========================================================================================================================
================================================= End of ROCm SMI Log ==================================================

Now different models fail (compared to previous tests) and I cannot figure out the reasons because VRAM clearly is not an issue; also RAM itself (in case it is needed to load the models) is not as well as even smaller models fail to load.
I have 64GB RAM and 160GB VRAM (10x 16GB).

The full log trying to run llama3.1:70b-instruct-q5_K_M (fails now as before) is below. On line 577 the segmentation fault error is generated, but I cannot see anything to indicate the reason from extra DEBUG statement:
_llama3.1_70b-instruct-q5_K_M--error.log

However, at this time also llama3.1:70b results in the same error, although it was loading few days ago.

<!-- gh-comment-id:2336801935 --> @vanife commented on GitHub (Sep 8, 2024): > @vanife can you run the server with OLLAMA_DEBUG=1 set? (setting this in the client has no effect) > > https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server OK. I have made following changes to the system: - updated amdgpu software, installed ROCm v 6.2.0 - reinstalled ollama (including scripts needed for AMD GPUs) - removed all NVIDIA GPUs, disabled iGPU, installed even more Radeon Pro VII for a total of 10 right now (some using "risers"). - set `OLLAMA_DEBUG=1` for the service. **results of `rocm-smi`** ``` =========================================== ROCm System Management Interface =========================================== ===================================================== Concise Info ===================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ======================================================================================================================== 0 1 0x66a1, 52718 41.0°C 17.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 1 2 0x66a1, 27240 35.0°C 15.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 2 3 0x66a1, 22570 31.0°C 15.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 3 4 0x66a1, 24674 35.0°C 22.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 4 5 0x66a1, 13217 32.0°C 19.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 5 6 0x66a1, 49889 30.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 6 7 0x66a1, 62886 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 7 8 0x66a1, 1254 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 8 9 0x66a1, 34102 32.0°C 20.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% 9 10 0x66a1, 46964 33.0°C 21.0W N/A, N/A, 0 860Mhz 350Mhz 9.41% auto 190.0W 0% 0% ======================================================================================================================== ================================================= End of ROCm SMI Log ================================================== ``` Now different models fail (compared to previous tests) and I cannot figure out the reasons because VRAM clearly is not an issue; also RAM itself (in case it is needed to load the models) is not as well as even smaller models fail to load. I have 64GB RAM and 160GB VRAM (10x 16GB). The full log trying to run `llama3.1:70b-instruct-q5_K_M` (fails now as before) is below. On line 577 the segmentation fault error is generated, but I cannot see anything to indicate the reason from extra DEBUG statement: [_llama3.1_70b-instruct-q5_K_M--error.log](https://github.com/user-attachments/files/16923455/_llama3.1_70b-instruct-q5_K_M--error.log) However, at this time also `llama3.1:70b` results in the same error, although it was loading few days ago.
Author
Owner

@dhiltgen commented on GitHub (Sep 12, 2024):

Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help.
You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.

<!-- gh-comment-id:2346769946 --> @dhiltgen commented on GitHub (Sep 12, 2024): Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.
Author
Owner

@MikeLP commented on GitHub (Sep 15, 2024):

@dhiltgen Unfortunately OLLAMA_GPU_OVERHEAD doesn't work in my case.

<!-- gh-comment-id:2351302262 --> @MikeLP commented on GitHub (Sep 15, 2024): @dhiltgen Unfortunately OLLAMA_GPU_OVERHEAD doesn't work in my case.
Author
Owner

@vanife commented on GitHub (Sep 26, 2024):

Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs.

I have tried many models and many sizes and came to the following observation 62-64 GB seems to be the breaking point. Here is what i mean by it: as soon as the model size is 62GB or more, the model fails to run with the error in the subject (Error: llama runner process has terminated: signal: segmentation fault (core dumped)).
Also by "model size" i mean what ollama ps shows in the SIZE column, and not the SIZE of the ollama ls.

I tried OLLAMA_NUM_PARALLEL=1 and the llama3.1:70b-instruct-q3_K_L loaded (ls size is "37 GB", ps size is "60 GB", whereas without that parameter it failed as its ollama ps size was "63 GB".

This gave me the hint to try something else: I have 11 identical GPUs (Radeon Pro VII with 16 GB VRAM each).

  • When i run llama3.1:8b-instruct-fp16 (ls: 16 GB, ps: 34 GB), it works and spreads over all 11 GPUs each ~14% load.
  • When i run llama3.1:70b-instruct-q3_K_M (ls: 34 GB, ps: 60 GB), it also works spreading work over 11 GPUs at ~30% load.
  • When i run llama3.1:70b-instruct-q3_K_L (ls: 37 gB, ps: 63 GB), it failed, but ollama ps briefly showed 63 GB (100% GPU).
  • so here i tried the OLLAMA_NUM_PARALLEL=1, and it allowed it to run
  • I then used HIP_AVAILABLE_DEVICES to limit number of GPUs, and surprisingly (to me) the ollama ps number was smaller for smaller number of devices: 53 GB for 7 GPUs, 50 GB for 5 GPUs and 45 GB for 3 GPUs, all of which loaded and ran without any problem.

I then repeated the same "test" on qwen2.5:72b-instruct-q3_K_M model, where it worked with 3 to 9 GPUs (61 GB for 9 GPUs), but it failed for 10 or 11 for which ollama ps was showing 63 and 64 GB.

Therefore, I have a strong suspicion that ~64GB something happens.

Any ideas?

Which two very similar settings should I run and provide with the log?

<!-- gh-comment-id:2377797237 --> @vanife commented on GitHub (Sep 26, 2024): > Somehow our prediction logic isn't correctly processing the context size across the GPUs, so we're loading too many layers on some GPU (probably GPU0). @MikeLP until we can find and fix the defect, there are a couple workarounds that may help. You can force OLLAMA_NUM_PARALLEL=1 which will reduce the context by 4x. You may be able to leverage the new env var OLLAMA_GPU_OVERHEAD to preserve some additional space on the GPUs. I have tried many models and many sizes and came to the following observation **62-64 GB seems to be the breaking point**. Here is what i mean by it: as soon as the model size is 62GB or more, the model fails to run with the error in the subject (`Error: llama runner process has terminated: signal: segmentation fault (core dumped)`). Also by "model size" i mean what `ollama ps` shows in the `SIZE` column, and not the `SIZE` of the `ollama ls`. I tried `OLLAMA_NUM_PARALLEL=1` and the `llama3.1:70b-instruct-q3_K_L` loaded (`ls size` is "37 GB", `ps size` is "60 GB", whereas without that parameter it failed as its `ollama ps` size was "63 GB". This gave me the hint to try something else: I have 11 identical GPUs (Radeon Pro VII with 16 GB VRAM each). - When i run `llama3.1:8b-instruct-fp16` (ls: 16 GB, ps: 34 GB), it works and spreads over all 11 GPUs each ~14% load. - When i run `llama3.1:70b-instruct-q3_K_M` (ls: 34 GB, ps: 60 GB), it also works spreading work over 11 GPUs at ~30% load. - When i run `llama3.1:70b-instruct-q3_K_L` (ls: 37 gB, ps: 63 GB), it failed, but `ollama ps` briefly showed 63 GB (100% GPU). * so here i tried the `OLLAMA_NUM_PARALLEL=1`, and it allowed it to run * I then used `HIP_AVAILABLE_DEVICES` to limit number of GPUs, and surprisingly (to me) the `ollama ps` number was smaller for smaller number of devices: 53 GB for 7 GPUs, 50 GB for 5 GPUs and 45 GB for 3 GPUs, all of which loaded and ran without any problem. I then repeated the same "test" on `qwen2.5:72b-instruct-q3_K_M` model, where it worked with 3 to 9 GPUs (61 GB for 9 GPUs), but it failed for 10 or 11 for which `ollama ps` was showing 63 and 64 GB. Therefore, I have a strong suspicion that ~64GB something happens. Any ideas? Which two very similar settings should I run and provide with the log?
Author
Owner

@MikeLP commented on GitHub (Sep 26, 2024):

@dhiltgen This info should help.

<!-- gh-comment-id:2378055006 --> @MikeLP commented on GitHub (Sep 26, 2024): @dhiltgen This info should help.
Author
Owner

@vanife commented on GitHub (Sep 27, 2024):

Any ideas?
Which two very similar settings should I run and provide with the log?

Please find attached ollama server log output:
ollama-6595-cast2.log
and also a corresponding tmux session .gif file:
ollama-6595-cast2-lossy

<!-- gh-comment-id:2378693992 --> @vanife commented on GitHub (Sep 27, 2024): > Any ideas? > Which two very similar settings should I run and provide with the log? Please find attached ollama server log output: [ollama-6595-cast2.log](https://github.com/user-attachments/files/17160864/ollama-6595-cast2.log) and also a corresponding tmux session .gif file: ![ollama-6595-cast2-lossy](https://github.com/user-attachments/assets/e9179854-c847-4456-96c3-6821c31f082a)
Author
Owner

@vanife commented on GitHub (Sep 27, 2024):

... happy to have a zoom call if i can be of further help ...

<!-- gh-comment-id:2378696554 --> @vanife commented on GitHub (Sep 27, 2024): _... happy to have a zoom call if i can be of further help ..._
Author
Owner

@MikeLP commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue?
@vanife Does it work in your case?

<!-- gh-comment-id:2425634561 --> @MikeLP commented on GitHub (Oct 21, 2024): @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?
Author
Owner

@vanife commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?

give me a day to check.
@MikeLP, do you know how exactly was it fixed? what was the culprit?

<!-- gh-comment-id:2425929454 --> @vanife commented on GitHub (Oct 21, 2024): > @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case? give me a day to check. @MikeLP, do you know how exactly was it fixed? what was the culprit?
Author
Owner

@MikeLP commented on GitHub (Oct 21, 2024):

@vanife
Changelog says Fix crashes for AMD GPUs with small system memory, and it doesn't make sense for my system, 128GB EEC DDR5 RAM and 96GB VRAM are not small system memory, but when after update I run llama3.1:405b-instruct-q2_K - it finally worked again.

<!-- gh-comment-id:2426046082 --> @MikeLP commented on GitHub (Oct 21, 2024): @vanife Changelog says `Fix crashes for AMD GPUs with small system memory`, and it doesn't make sense for my system, 128GB EEC DDR5 RAM and 96GB VRAM are not small _system memory_, but when after update I run `llama3.1:405b-instruct-q2_K` - it finally worked again.
Author
Owner

@vanife commented on GitHub (Oct 21, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run.
However, all large (I think, all multi-GPU) models generate total rubbish.
In any event, this is another issue, so it makes sense to close this one.

<!-- gh-comment-id:2426575094 --> @vanife commented on GitHub (Oct 21, 2024): I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.
Author
Owner

@farshadghodsian commented on GitHub (Oct 21, 2024):

@dhiltgen Latest update 0.3.14 fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case?

I can also confirm that the latest update fixes similar issue I had with Segment Fault when trying to load large models (LLama3.1-405B-q4). After updating to 0.3.14 today I can now successfully load and run LLama3.1-405b on 5x W7900 GPUs. Runs a bit slow, but happy that it now finally works as I've been troubleshooting this issue the last few days.

Error seen on Ollama 0.3.13:
Screenshot 2024-10-21 152440

Fixed after updating to Ollama 0.3.14 with no other change:
Screenshot 2024-10-21 153220

<!-- gh-comment-id:2427555472 --> @farshadghodsian commented on GitHub (Oct 21, 2024): > @dhiltgen Latest update `0.3.14` fixed the crash I had for large models. Should I close the issue? @vanife Does it work in your case? I can also confirm that the latest update fixes similar issue I had with Segment Fault when trying to load large models (LLama3.1-405B-q4). After updating to `0.3.14` today I can now successfully load and run LLama3.1-405b on 5x W7900 GPUs. Runs a bit slow, but happy that it now finally works as I've been troubleshooting this issue the last few days. Error seen on Ollama `0.3.13`: ![Screenshot 2024-10-21 152440](https://github.com/user-attachments/assets/8e99bc8c-c44c-4f00-bf8a-1d03c91ac780) Fixed after updating to Ollama `0.3.14` with no other change: ![Screenshot 2024-10-21 153220](https://github.com/user-attachments/assets/2132b432-82f8-42ef-b2b9-b48074a27880)
Author
Owner

@MikeLP commented on GitHub (Oct 22, 2024):

...However, all large (I think, all multi-GPU) models generate total rubbish.

Interesting, in my case output is good.

<!-- gh-comment-id:2428627589 --> @MikeLP commented on GitHub (Oct 22, 2024): > ...However, all large (I think, all multi-GPU) models generate total rubbish. Interesting, in my case output is good.
Author
Owner

@farshadghodsian commented on GitHub (Oct 22, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.

If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.

<!-- gh-comment-id:2428658695 --> @farshadghodsian commented on GitHub (Oct 22, 2024): > I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one. If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.
Author
Owner

@vanife commented on GitHub (Oct 22, 2024):

I can confirm that I am not getting this error anymore and the model loads, including the llama3.1:405b-instruct-q2_K you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one.

If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues.

Thank you for the hint. I tried it, but it did not help at all.
Previously I had an issue with the model being crossing the 64GB in size (see my comments above), but now i tried various models and configurations and as soon it does not fit in 1 GPU, it produces total garbage; whereas it worked before for those fitting into few.

I have 11x AMD Radeon Pro 7 with 16GB VRAM each. I wonder if somehow the support for them has been removed now, as they are quite dated...

But again - no segmenation fault error anymore, so this must be a different issue now, so this could be closed from my perspective.

<!-- gh-comment-id:2429852914 --> @vanife commented on GitHub (Oct 22, 2024): > > I can confirm that I am not getting this error anymore and the model loads, including the `llama3.1:405b-instruct-q2_K` you run. However, all large (I think, all multi-GPU) models generate total rubbish. In any event, this is another issue, so it makes sense to close this one. > > If you are on Linux and having issues with Ollama generating gibberish on multi-gpu setup with AMD GPUs it could be that you need to enable iommu pass through in your grub (see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/install-faq.html#issue-5-application-hangs-on-multi-gpu-systems). I've experienced it with earlier versions of ROCm and doing that fixed all my multi-gpu issues. Thank you for the hint. I tried it, but it did not help at all. Previously I had an issue with the model being crossing the 64GB in size (see my comments above), but now i tried various models and configurations and as soon it does not fit in 1 GPU, it produces total garbage; whereas it worked before for those fitting into few. I have 11x AMD Radeon Pro 7 with 16GB VRAM each. I wonder if somehow the support for them has been removed now, as they are quite dated... _But again - no segmenation fault error anymore, so this must be a different issue now, so this could be closed from my perspective._
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

@vanife if you force the system to under-allocate layers by setting OLLAMA_GPU_OVERHEAD does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation?

How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.

<!-- gh-comment-id:2429884814 --> @dhiltgen commented on GitHub (Oct 22, 2024): @vanife if you force the system to under-allocate layers by setting `OLLAMA_GPU_OVERHEAD` does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation? How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.
Author
Owner

@vanife commented on GitHub (Oct 22, 2024):

@vanife if you force the system to under-allocate layers by setting OLLAMA_GPU_OVERHEAD does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation?

How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior.

i have just 64GB RAM (and this seem to have been my suspicion in my prior comments where if the total VRAM need was around this number, it failed to load with "segmentation fault" error). search for "Therefore, I have a strong suspicion that ~64GB something happens." text to find my comment above on this topic.

Now (using version 0.3.14) i do not get a "segmentation fault" anymore, but all the models even 20-30GB result in complete garbage as soon as the model does not fit into 1 GPU.

I do not think that setting OLLAMA_GPU_OVERHEAD will change anything, as loading llama3.1:70b-instruct-q5_K_M spreads nicely across GPUs (25-30% VRAM usage across 11, ~60% across 5 and ~73% across 4 GPUs) leaving plenty of free space.

<!-- gh-comment-id:2429961627 --> @vanife commented on GitHub (Oct 22, 2024): > @vanife if you force the system to under-allocate layers by setting `OLLAMA_GPU_OVERHEAD` does that yield good output across your multiple GPUs, or is the gibberish behavior consistent even with under-allocation? > > How much system memory do you have? From what I understand, AMD recommends your system memory be larger than VRAM for realible ROCm behavior. i have just 64GB RAM (and this seem to have been my suspicion in my prior comments where if the total VRAM need was around this number, it failed to load with "segmentation fault" error). _search for "Therefore, I have a strong suspicion that ~64GB something happens." text to find my comment above on this topic._ Now (using version 0.3.14) i do not get a "segmentation fault" anymore, but all the models even 20-30GB result in complete garbage as soon as the model does not fit into 1 GPU. I do not think that setting `OLLAMA_GPU_OVERHEAD` will change anything, as loading `llama3.1:70b-instruct-q5_K_M` spreads nicely across GPUs (25-30% VRAM usage across 11, ~60% across 5 and ~73% across 4 GPUs) leaving plenty of free space.
Author
Owner

@vanife commented on GitHub (Oct 22, 2024):

OK. the hypothesis that the problem is strictly related to multi-GPU is not correct.

On my PC i have 11 AMD Radeon Pro VII cards (and 2 more Nvidia, which are disabled for LLM usage), but I use PCIe riser/splitters to connect them, as my mobo does not have so many PCIe slots.

I have now enabled only 1 card per splitter using HIP_VISIBLE_DEVICES, and ollama loads without any problem on 3-4 GPUs as long as I only use one GPU per riser/splitter.
In this setup I can only use 4 gpus as i have only 4 pcie slots directy on mobo, which limits total usable GPU VRAM to 64 (4x16), which coincidentally is also my RAM limit. So my prior issue which resulted in segmentation fault could also be related to this setup.

I do not yet understand why this is the case, but it works for llama3.1:70b-instruct-q5_K_M without any problem.

Question: any idea why this might be the case? below is extract from lspc output:

--00.0-[03]--+-00.0 [Radeon Pro VII]

+-00.0-[07-13]----00.0-[08-13]--+-02.0-[0a-0c]----00.0-[0b-0c]----00.0-[0c]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0 [Radeon Pro VII]
|                               +-06.0-[12]--+-00.0  NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER]
|                               \-07.0-[13]--+-00.0  NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER]

+-08.0-[14-21]----00.0-[15-21]--+-01.0-[16-18]----00.0-[17-18]----00.0-[18]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[19-1b]----00.0-[1a-1b]----00.0-[1b]--+-00.0 [Radeon Pro VII]
|                               +-05.0-[1c-1e]----00.0-[1d-1e]----00.0-[1e]--+-00.0 [Radeon Pro VII]
|                               \-07.0-[1f-21]----00.0-[20-21]----00.0-[21]--+-00.0 [Radeon Pro VII]

+-09.0-[22-32]----00.0-[23-32]--+-02.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0 [Radeon Pro VII]
|                               +-03.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0 [Radeon Pro VII]
|                               +-06.0-[2d-2f]----00.0-[2e-2f]----00.0-[2f]--+-00.0 [Radeon Pro VII]
|                               \-07.0-[30-32]----00.0-[31-32]----00.0-[32]--+-00.0 [Radeon Pro VII]
<!-- gh-comment-id:2430109049 --> @vanife commented on GitHub (Oct 22, 2024): OK. the hypothesis that the problem is strictly related to multi-GPU is not correct. On my PC i have 11 AMD Radeon Pro VII cards (and 2 more Nvidia, which are disabled for LLM usage), but I use PCIe riser/splitters to connect them, as my mobo does not have so many PCIe slots. I have now enabled only 1 card per splitter using `HIP_VISIBLE_DEVICES`, and ollama loads without any problem on 3-4 GPUs as long as I only use one GPU per riser/splitter. In this setup I can only use 4 gpus as i have only 4 pcie slots directy on mobo, which limits total usable GPU VRAM to 64 (4x16), which coincidentally is also my RAM limit. So my prior issue which resulted in segmentation fault could also be related to this setup. I do not yet understand why this is the case, but it works for `llama3.1:70b-instruct-q5_K_M` without any problem. Question: any idea why this might be the case? below is extract from `lspc` output: ``` --00.0-[03]--+-00.0 [Radeon Pro VII] +-00.0-[07-13]----00.0-[08-13]--+-02.0-[0a-0c]----00.0-[0b-0c]----00.0-[0c]--+-00.0 [Radeon Pro VII] | +-03.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]--+-00.0 [Radeon Pro VII] | +-06.0-[12]--+-00.0 NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER] | \-07.0-[13]--+-00.0 NVIDIA Corporation AD104 [GeForce RTX 4070 SUPER] +-08.0-[14-21]----00.0-[15-21]--+-01.0-[16-18]----00.0-[17-18]----00.0-[18]--+-00.0 [Radeon Pro VII] | +-03.0-[19-1b]----00.0-[1a-1b]----00.0-[1b]--+-00.0 [Radeon Pro VII] | +-05.0-[1c-1e]----00.0-[1d-1e]----00.0-[1e]--+-00.0 [Radeon Pro VII] | \-07.0-[1f-21]----00.0-[20-21]----00.0-[21]--+-00.0 [Radeon Pro VII] +-09.0-[22-32]----00.0-[23-32]--+-02.0-[25-27]----00.0-[26-27]----00.0-[27]--+-00.0 [Radeon Pro VII] | +-03.0-[28-2a]----00.0-[29-2a]----00.0-[2a]--+-00.0 [Radeon Pro VII] | +-06.0-[2d-2f]----00.0-[2e-2f]----00.0-[2f]--+-00.0 [Radeon Pro VII] | \-07.0-[30-32]----00.0-[31-32]----00.0-[32]--+-00.0 [Radeon Pro VII] ```
Author
Owner

@dhiltgen commented on GitHub (Oct 22, 2024):

The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: 2f0e81e053

This was introduced to workaround problems with multiple GPUs doing P2P copies between them.

I believe setting NCCL_P2P_DISABLE=1 on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?

<!-- gh-comment-id:2430496287 --> @dhiltgen commented on GitHub (Oct 22, 2024): The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: https://github.com/ggerganov/llama.cpp/commit/2f0e81e053b41ca28e73a841e7bdbf9820baaa57 This was introduced to workaround problems with multiple GPUs doing P2P copies between them. I believe setting `NCCL_P2P_DISABLE=1` on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?
Author
Owner

@vanife commented on GitHub (Oct 23, 2024):

The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: ggerganov/llama.cpp@2f0e81e

This was introduced to workaround problems with multiple GPUs doing P2P copies between them.

I believe setting NCCL_P2P_DISABLE=1 on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps?

Thank you for more ideas. But i do not see how setting this flag changing anything, i do not see it being recognized by the server at runtime. I tried both NCCL_P2P_DISABLE=1 and GGML_CUDA_NO_PEER_COPY=1 in ollama.service config, but the logging does not show any of those being used, and no change to my previous results.

I am pretty sure it has something to do with my PCIe splitters, here is why:

  • i have 11 gpus on "consumer" mobo, so i use splitters (like this one: https://www.aliexpress.com/item/1005006779220914.html)
  • when i do not disable any GPU, the model loads, but produces garbage
  • when i disable GPUs so that only 4 are available (0,1,2,3 of 11 using various combinations), result is the above - garbage
  • when i enable only 1 gpu per original PCIe on the board (1 connected directly to mobo + 3x using splitters, but only 1 per splitter/mobo-pcie), the model loads and works.

hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this.


again, i think this issue should be closed as the "segmentation fault" is not generated anymore since 0.3.14

<!-- gh-comment-id:2431889802 --> @vanife commented on GitHub (Oct 23, 2024): > The change we made in 0.3.14 that resolved the problem for most everyone was to stop enabling a workaround in llama.cpp related to GGML_CUDA_NO_PEER_COPY - origin: [ggerganov/llama.cpp@2f0e81e](https://github.com/ggerganov/llama.cpp/commit/2f0e81e053b41ca28e73a841e7bdbf9820baaa57) > > This was introduced to workaround problems with multiple GPUs doing P2P copies between them. > > I believe setting `NCCL_P2P_DISABLE=1` on the server may accomplish the same thing at runtime. @vanife can you try that and see if it helps? Thank you for more ideas. But i do not see how setting this flag changing anything, i do not see it being recognized by the server at runtime. I tried both `NCCL_P2P_DISABLE=1` and `GGML_CUDA_NO_PEER_COPY=1` in `ollama.service` config, but the logging does not show any of those being used, and no change to my previous results. I am pretty sure it has something to do with my PCIe splitters, here is why: - i have 11 gpus on "consumer" mobo, so i use splitters (like this one: https://www.aliexpress.com/item/1005006779220914.html) - when i do not disable any GPU, the model loads, but produces garbage - when i disable GPUs so that only 4 are available (0,1,2,3 of 11 using various combinations), result is the above - garbage - when i enable only 1 gpu per original PCIe on the board (1 connected directly to mobo + 3x using splitters, but only 1 per splitter/mobo-pcie), the model loads and works. hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this. --- _again, i think this issue should be closed as the "segmentation fault" is not generated anymore since 0.3.14_
Author
Owner

@saman-amd commented on GitHub (Oct 23, 2024):

Hey @vanife
GGML_CUDA_NO_PEER_COPY=1 is a build flag which needs to be used at compile time, so will not make any changes at runtime. Ollama 0.3.13 had this flag set, the segmentation fault would have been caused if RAM < Model_Size < VRAM
so since you have 64GB RAM, I guess if you use 0.3.13 version and use a model < 64 GB you shouldn't experience the Seg Fault, maybe you could rule that scenario out if you would see the same garbage output with 0.3.13 and a model_size < 64 GB.
I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?

<!-- gh-comment-id:2432512327 --> @saman-amd commented on GitHub (Oct 23, 2024): Hey @vanife GGML_CUDA_NO_PEER_COPY=1 is a build flag which needs to be used at compile time, so will not make any changes at runtime. Ollama 0.3.13 had this flag set, the segmentation fault would have been caused if RAM < Model_Size < VRAM so since you have 64GB RAM, I guess if you use 0.3.13 version and use a model < 64 GB you shouldn't experience the Seg Fault, maybe you could rule that scenario out if you would see the same garbage output with 0.3.13 and a model_size < 64 GB. I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?
Author
Owner

@dhiltgen commented on GitHub (Oct 23, 2024):

@vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick.

Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs

  • should be under same PCI root port
  • Large BAR enabled
  • IOMMMU disabled

So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.

<!-- gh-comment-id:2432647954 --> @dhiltgen commented on GitHub (Oct 23, 2024): @vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick. Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs - should be under same PCI root port - Large BAR enabled - IOMMMU disabled So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.
Author
Owner

@vanife commented on GitHub (Oct 24, 2024):

I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ?

short answer: NO

I ran few more tests and can confirm the following: when i enable only 2 nvidia "GeForce RTX 4070 SUPER (12GB VRAM)" gpus (which are also on one PCIe with a splitter shared also with 2 more "Radeon Pro VII" gpus, ollama models load on both and work properly - no error and no garbage output. I have tested the following modes [name, ls size, ps size, split]:

+==============================+===========+==========+==========================+
|Model                         | Size (ls) | Size (ps)| Comment                  |
+==============================+===========+==========+==========================+
| llama3.1:70b-instruct-q5_K_M | 49.00     | 55.00    | 56%/44% CPU/GPU: 2x4070S |
+------------------------------+-----------+----------+--------------------------+
| qwen2:7b-instruct-fp16       | 15.00     | 17.00    | 100% gpu                 |
+------------------------------+-----------+----------+--------------------------+
| llama3.1:8b-instruct-fp16    | 14.00     | 18.00    | 100% gpu                 |
+==============================+===========+==========+==========================+

So it appears to be limited to AMD cards. I am not sure why this is the case, and would love to find out why and how it can be solved. I would buy a more robust system (mobo+proc+ram) and would offload there some of my Pro7, but first i somehow would like to understand if these gpus are even making sense to use for local llms. happy to explore any other ideas.

<!-- gh-comment-id:2435380906 --> @vanife commented on GitHub (Oct 24, 2024): >I was wondering if you've ever tried running this on your 2x Nvidia gpus that you have on your splitter ? does the same thing happen ? **short answer: NO** I ran few more tests and can confirm the following: when i enable only 2 nvidia "GeForce RTX 4070 SUPER (12GB VRAM)" gpus (which are also on one PCIe with a splitter shared also with 2 more "Radeon Pro VII" gpus, ollama models load on both and work properly - no error and no garbage output. I have tested the following modes [name, ls size, ps size, split]: ```markdown +==============================+===========+==========+==========================+ |Model | Size (ls) | Size (ps)| Comment | +==============================+===========+==========+==========================+ | llama3.1:70b-instruct-q5_K_M | 49.00 | 55.00 | 56%/44% CPU/GPU: 2x4070S | +------------------------------+-----------+----------+--------------------------+ | qwen2:7b-instruct-fp16 | 15.00 | 17.00 | 100% gpu | +------------------------------+-----------+----------+--------------------------+ | llama3.1:8b-instruct-fp16 | 14.00 | 18.00 | 100% gpu | +==============================+===========+==========+==========================+ ``` _So it appears to be limited to AMD cards. I am not sure why this is the case, and would love to find out why and how it can be solved. I would buy a more robust system (mobo+proc+ram) and would offload there some of my Pro7, but first i somehow would like to understand if these gpus are even making sense to use for local llms. happy to explore any other ideas._
Author
Owner

@vanife commented on GitHub (Oct 24, 2024):

@vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick.

Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs

  • should be under same PCI root port
  • Large BAR enabled
  • IOMMMU disabled

So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible.

  • what is "same PCI root port"? how do i make sure this is the case?
  • large bar: this is set in bios
  • iommu disabled: i set iommu=pt as per previous suggestion, but no change (also tried completely disabled in bios - also no change)
  • i did set Environment="NCCL_P2P_DISABLE=1" in ollama server config, but no change
<!-- gh-comment-id:2435391070 --> @vanife commented on GitHub (Oct 24, 2024): > @vanife agreed we can close this one now, I'm just hoping we can find the root cause and a viable solution for your setup as well, and doc for others if it turns out to be env flags that do the trick. > > Ollama doesn't implement this P2P env var, but I believe one of the underlying libraries we use does. What I've been told is direct GPU <--> GPU copy works only under certain conditions. Like both GPUs > > * should be under same PCI root port > * Large BAR enabled > * IOMMMU disabled > > So our hope is that by setting this env var at runtime, you'll accomplish the ~same behavior as the build-time setting we no longer set, and disable P2P copy, which in your setup mat not be possible. - what is "same PCI root port"? how do i make sure this is the case? - large bar: this is set in bios - iommu disabled: i set `iommu=pt` as per previous suggestion, but no change (also tried completely disabled in bios - also no change) - i did set `Environment="NCCL_P2P_DISABLE=1"` in ollama server config, but no change
Author
Owner

@vanife commented on GitHub (Oct 24, 2024):

i am wondering if it could be related to the fact that Pro7 is not actively supported, and some now required features could be the reason for the failure. extract from https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html

image
<!-- gh-comment-id:2435408729 --> @vanife commented on GitHub (Oct 24, 2024): i am wondering if it could be related to the fact that Pro7 is not actively supported, and some now required features could be the reason for the failure. extract from https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html <img width="753" alt="image" src="https://github.com/user-attachments/assets/c827a1ac-8b79-4bcb-a944-f44903a5f2bd">
Author
Owner

@vanife commented on GitHub (Oct 24, 2024):

hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this.

and just to validate this hypothesis further i added also a m.2 to GPU riser, which allowed me to connect one more GPU "directly" to mobo so that it is not sharing anything via splitter (or it is another splitter for just one GPU).
Outcome? => now i can run larger models on 5 (previously 4) GPUs, maximum 1 per PCIe/M.2. but as soon as i use even 2 cards on the same pcie splitter => i get rubbish as result.

Hence the "pcie spltiters" are the cause. i have different brands/types and i tried them all - same result.


Still would be nice to know the "real" reason, and not just situational.

<!-- gh-comment-id:2436145187 --> @vanife commented on GitHub (Oct 24, 2024): > hence i conclude that the issue is somehow related to the fact that i use pcie spliters (which are used for mining, which i do not do, but i do other calculations which work well in such setup). but it would be great to understand the actual reason for this. and just to validate this hypothesis further i added also a m.2 to GPU riser, which allowed me to connect one more GPU "directly" to mobo so that it is not sharing anything via splitter (or it is another splitter for just one GPU). Outcome? => now i can run larger models on 5 (previously 4) GPUs, maximum 1 per PCIe/M.2. _but as soon as i use even 2 cards on the same pcie splitter => i get rubbish as result_. Hence the "pcie spltiters" are the cause. i have different brands/types and i tried them all - same result. --- Still would be nice to know the "real" reason, and not just situational.
Author
Owner

@MikeLP commented on GitHub (Oct 26, 2024):

@vanife Idk will it help you, but there are some limitations explained here
https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/install/native_linux/mgpu.html

Hardware
✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection
✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection
X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection

And

Important!
Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration.
Ensure the system Power Supply Unit (PSU) has sufficient wattage to support multiple GPUs.

<!-- gh-comment-id:2439488832 --> @MikeLP commented on GitHub (Oct 26, 2024): @vanife Idk will it help you, but there are some limitations explained here https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/install/native_linux/mgpu.html > Hardware ✓ - GPU0 PCIe x16 connection + GPU1 PCIe x16 connection ✓ - GPU0 PCIe x8 connection + GPU1 PCIe x8 connection X - GPU0 PCIe x16 connection + GPU1 PCIe x8 connection And > Important! Only use PCIe slots connected by the CPU and to avoid PCIe slots connected via chipset. Refer to product-specific motherboard documentation for PCIe electrical configuration. Ensure the system Power Supply Unit (PSU) has sufficient wattage to support multiple GPUs.
Author
Owner

@MikeLP commented on GitHub (Nov 8, 2024):

@dhiltgen Sorry to disturb, but I've encountered a new error specific to version 0.4.0 - 'llama runner process has terminated: exit status 127'.

I'm seeing this issue with all model sizes despite VRAM size, and the logs indicate that VRAM can't recover. Notably, version 0.3.14 works perfectly fine, so it doesn't seem to be a hardware issue. Should I open a new issue since it's related to ROCm, or is this already a known issue like this (https://github.com/ollama/ollama/issues/7542)?

<!-- gh-comment-id:2463616261 --> @MikeLP commented on GitHub (Nov 8, 2024): @dhiltgen Sorry to disturb, but I've encountered a new error specific to version 0.4.0 - 'llama runner process has terminated: exit status 127'. I'm seeing this issue with all model sizes despite VRAM size, and the logs indicate that VRAM can't recover. Notably, version 0.3.14 works perfectly fine, so it doesn't seem to be a hardware issue. Should I open a new issue since it's related to ROCm, or is this already a known issue like this (https://github.com/ollama/ollama/issues/7542)?
Author
Owner

@vanife commented on GitHub (Nov 9, 2024):

I can also confirm a new issue where 0.3.14 works (4x AMD GPU setup on linux), but fails on both 0.4.0 and 0.4.1 with "Error: llama runner process has terminated: error loading model: unable to allocate backend buffer".

<!-- gh-comment-id:2466009398 --> @vanife commented on GitHub (Nov 9, 2024): I can also confirm a new issue where 0.3.14 works (4x AMD GPU setup on linux), but fails on both 0.4.0 and 0.4.1 with "Error: llama runner process has terminated: error loading model: unable to allocate backend buffer".
Author
Owner

@dhiltgen commented on GitHub (Nov 13, 2024):

@MikeLP @vanife please check your server logs to see what caused the runner to crash. If you don't see any other recent issues that have the same failure, please file a new issue.

<!-- gh-comment-id:2474937091 --> @dhiltgen commented on GitHub (Nov 13, 2024): @MikeLP @vanife please check your server logs to see what caused the runner to crash. If you don't see any other recent issues that have the same failure, please file a new issue.
Author
Owner

@dhiltgen commented on GitHub (Feb 25, 2025):

Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.

<!-- gh-comment-id:2682725796 --> @dhiltgen commented on GitHub (Feb 25, 2025): Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.
Author
Owner

@vanife commented on GitHub (Feb 26, 2025):

Is this still a problem with the latest versions? I'm trying to determine if #7378 is still useful.

I cannot say if #7378 is useful, but I do not experience crashes since some time now with and without it been set. Thank you.

<!-- gh-comment-id:2684119968 --> @vanife commented on GitHub (Feb 26, 2025): > Is this still a problem with the latest versions? I'm trying to determine if [#7378](https://github.com/ollama/ollama/pull/7378) is still useful. I cannot say if #7378 is useful, but I do not experience crashes since some time now with and without it been set. Thank you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66190