[GH-ISSUE #6756] Yet another "segmentation fault" issue with AMD GPU #50770

Closed
opened 2026-04-28 17:04:57 -05:00 by GiteaMirror · 57 comments
Owner

Originally created by @ross-rosario on GitHub (Sep 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6756

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Error: llama runner process has terminated: signal: segmentation fault (core dumped). It occurs while loading larger models, that are still within the VRAM capacity. Here I'm trying to load command-r:35b-08-2024-q4_K_M (19GB), on an RX 7900 XTX with 24GB of VRAM. Smaller models load fine.

Edit: even with gemma2:27b-instruct-q4_K_M 16 GB I still get the error. It seems that max model size that can be loaded is 13 GB, eg codestral:22b-v0.1-q4_K_M.

From the logs: ollama clearly says that available vram is 23.5 GiB
Sep 11 09:41:57 computer ollama[71334]: time=2024-09-11T09:41:57.987-06:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=0.0 name=1002:744c total="24.0 GiB" available="23.5 GiB"

Error:
Sep 11 09:43:18 computer ollama[71334]: time=2024-09-11T09:43:18.408-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 11 09:43:26 computer ollama[71334]: time=2024-09-11T09:43:26.324-06:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: seg

I could've swore that models of that size used to load just fine on older ollama version, but unfortunately, not sure which was the latest ollama version that worked.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.3.10

Originally created by @ross-rosario on GitHub (Sep 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6756 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? `Error: llama runner process has terminated: signal: segmentation fault (core dumped)`. It occurs while loading larger models, that are still within the VRAM capacity. Here I'm trying to load `command-r:35b-08-2024-q4_K_M` (**19GB**), on an RX 7900 XTX with **24GB** of VRAM. Smaller models load fine. **Edit**: even with `gemma2:27b-instruct-q4_K_M` `16 GB` I still get the error. It seems that max model size that can be loaded is `13 GB`, eg `codestral:22b-v0.1-q4_K_M`. From the logs: ollama clearly says that available vram is `23.5 GiB` `Sep 11 09:41:57 computer ollama[71334]: time=2024-09-11T09:41:57.987-06:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=0.0 name=1002:744c total="24.0 GiB" available="23.5 GiB"` Error: `Sep 11 09:43:18 computer ollama[71334]: time=2024-09-11T09:43:18.408-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 11 09:43:26 computer ollama[71334]: time=2024-09-11T09:43:26.324-06:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: seg` I could've swore that models of that size used to load just fine on older ollama version, but unfortunately, not sure which was the latest ollama version that worked. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.3.10
GiteaMirror added the bugamdlinux labels 2026-04-28 17:05:11 -05:00
Author
Owner

@igorschlum commented on GitHub (Sep 11, 2024):

hi @remon-nashid can you retry using ollama 0.3.10 ?

<!-- gh-comment-id:2344221242 --> @igorschlum commented on GitHub (Sep 11, 2024): hi @remon-nashid can you retry using ollama 0.3.10 ?
Author
Owner

@ross-rosario commented on GitHub (Sep 11, 2024):

@igorschlum will do as soon as it lands in archlinux packages.

<!-- gh-comment-id:2344254585 --> @ross-rosario commented on GitHub (Sep 11, 2024): @igorschlum will do as soon as it lands in archlinux packages.
Author
Owner

@dhiltgen commented on GitHub (Sep 12, 2024):

I don't have an identical setup, but I tried to repro on windows on an 7900 XTX with 0.3.9 and 0.3.10 and they both load this model OK. @remon-nashid are you specifying any custom parameters like context size? Server logs might help narrow things down as well.

<!-- gh-comment-id:2345001536 --> @dhiltgen commented on GitHub (Sep 12, 2024): I don't have an identical setup, but I tried to repro on windows on an 7900 XTX with 0.3.9 and 0.3.10 and they both load this model OK. @remon-nashid are you specifying any custom parameters like context size? Server logs might help narrow things down as well.
Author
Owner

@ross-rosario commented on GitHub (Sep 12, 2024):

Hi @dhiltgen , thanks for looking into this. I'm not specifying neither a custom context size nor other parameters.

Below are the logs from my attempt to ollama run command-r:35b-08-2024-q4_K_M.

Sep 11 18:18:59 computer ollama[67030]: time=2024-09-11T18:18:59.630-06:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1037403190/runners
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu rocm]"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.700-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.707-06:00 level=WARN source=amd_linux.go:59 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.708-06:00 level=INFO source=amd_linux.go:348 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.709-06:00 level=INFO source=amd_linux.go:274 msg="unsupported Radeon iGPU detected skipping" id=1 total="512.0 MiB"
Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.709-06:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=0.0 name=1002:744c total="24.0 GiB" available="22.7 GiB"
Sep 11 18:19:31 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:31 | 200 |      50.265µs |       127.0.0.1 | HEAD     "/"
Sep 11 18:19:31 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:31 | 200 |    63.60154ms |       127.0.0.1 | POST     "/api/show"
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.435-06:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 gpu=0 parallel=4 available=24412168192 required="21.7 GiB"
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.436-06:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[22.7 GiB]" memory.required.full="21.7 GiB" memory.required.partial="21.7 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[21.7 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="16.5 GiB" memory.weights.nonrepeating="1.6 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="2.1 GiB"
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.439-06:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1037403190/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 4 --port 33599"
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 11 18:19:31 computer ollama[67338]: INFO [main] build info | build=3535 commit="1e6f6554a" tid="127366016851008" timestamp=1726100371
Sep 11 18:19:31 computer ollama[67338]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127366016851008" timestamp=1726100371 total_threads=32
Sep 11 18:19:31 computer ollama[67338]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="33599" tid="127366016851008" timestamp=1726100371
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: loaded meta data with 34 key-value pairs and 322 tensors from /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 (version GGUF V3 (latest))
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   0:                       general.architecture str              = command-r
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   2:                               general.name str              = C4Ai Command R 08 2024
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   3:                            general.version str              = 08-2024
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   4:                           general.basename str              = c4ai-command-r
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   5:                         general.size_label str              = 32B
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   6:                            general.license str              = cc-by-nc-4.0
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   7:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   8:                      command-r.block_count u32              = 40
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv   9:                   command-r.context_length u32              = 131072
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  10:                 command-r.embedding_length u32              = 8192
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  11:              command-r.feed_forward_length u32              = 24576
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  12:             command-r.attention.head_count u32              = 64
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  13:          command-r.attention.head_count_kv u32              = 8
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  14:                   command-r.rope.freq_base f32              = 4000000.000000
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  15:     command-r.attention.layer_norm_epsilon f32              = 0.000010
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  16:                          general.file_type u32              = 15
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  17:                      command-r.logit_scale f32              = 0.062500
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  18:                command-r.rope.scaling.type str              = none
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = command-r
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.691-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 5
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 255001
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  29:           tokenizer.chat_template.tool_use str              = {{ bos_token }}{% if messages[0]['rol...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  30:                tokenizer.chat_template.rag str              = {{ bos_token }}{% if messages[0]['rol...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  31:                   tokenizer.chat_templates arr[str,2]       = ["tool_use", "rag"]
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv  33:               general.quantization_version u32              = 2
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type  f32:   41 tensors
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type q4_K:  240 tensors
Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type q6_K:   41 tensors
Sep 11 18:19:32 computer ollama[67030]: llm_load_vocab: special tokens cache size = 42
Sep 11 18:19:32 computer ollama[67030]: llm_load_vocab: token to piece cache size = 1.8428 MB
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: arch             = command-r
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: vocab type       = BPE
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_vocab          = 256000
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_merges         = 253333
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: vocab_only       = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ctx_train      = 131072
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd           = 8192
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_layer          = 40
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_head           = 64
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_head_kv        = 8
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_rot            = 128
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_swa            = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_head_k    = 128
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_head_v    = 128
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_gqa            = 8
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_logit_scale    = 6.2e-02
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ff             = 24576
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_expert         = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_expert_used    = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: causal attn      = 1
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: pooling type     = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope type        = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope scaling     = none
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: freq_base_train  = 4000000.0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: freq_scale_train = 1
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope_finetuned   = unknown
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_conv       = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_inner      = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_state      = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model type       = 35B
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model ftype      = Q4_K - Medium
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model params     = 32.30 B
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model size       = 18.43 GiB (4.90 BPW)
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: general.name     = C4Ai Command R 08 2024
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: PAD token        = 0 '<PAD>'
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: LF token         = 136 'Ä'
Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: max token length = 1024
Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: found 1 ROCm devices:
Sep 11 18:19:35 computer ollama[67030]:   Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
Sep 11 18:19:35 computer ollama[67030]: llm_load_tensors: ggml ctx size =    0.31 MiB
Sep 11 18:19:36 computer ollama[67030]: time=2024-09-11T18:19:36.916-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 11 18:19:37 computer ollama[67030]: time=2024-09-11T18:19:37.232-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloading 40 repeating layers to GPU
Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloaded 41/41 layers to GPU
Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors:      ROCm0 buffer size = 18873.16 MiB
Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors:  ROCm_Host buffer size =  1640.62 MiB
Sep 11 18:19:37 computer ollama[67030]: time=2024-09-11T18:19:37.683-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
Sep 11 18:19:47 computer ollama[67030]: time=2024-09-11T18:19:47.855-06:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Sep 11 18:19:47 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:47 | 500 | 16.529325999s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2345017619 --> @ross-rosario commented on GitHub (Sep 12, 2024): Hi @dhiltgen , thanks for looking into this. I'm not specifying neither a custom context size nor other parameters. Below are the logs from my attempt to `ollama run command-r:35b-08-2024-q4_K_M`. ``` Sep 11 18:18:59 computer ollama[67030]: time=2024-09-11T18:18:59.630-06:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1037403190/runners Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu rocm]" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.676-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.700-06:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.707-06:00 level=WARN source=amd_linux.go:59 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.708-06:00 level=INFO source=amd_linux.go:348 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0 Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.709-06:00 level=INFO source=amd_linux.go:274 msg="unsupported Radeon iGPU detected skipping" id=1 total="512.0 MiB" Sep 11 18:19:03 computer ollama[67030]: time=2024-09-11T18:19:03.709-06:00 level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx1100 driver=0.0 name=1002:744c total="24.0 GiB" available="22.7 GiB" Sep 11 18:19:31 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:31 | 200 | 50.265µs | 127.0.0.1 | HEAD "/" Sep 11 18:19:31 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:31 | 200 | 63.60154ms | 127.0.0.1 | POST "/api/show" Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.435-06:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 gpu=0 parallel=4 available=24412168192 required="21.7 GiB" Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.436-06:00 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[22.7 GiB]" memory.required.full="21.7 GiB" memory.required.partial="21.7 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[21.7 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="16.5 GiB" memory.weights.nonrepeating="1.6 GiB" memory.graph.full="1.1 GiB" memory.graph.partial="2.1 GiB" Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.439-06:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1037403190/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 4 --port 33599" Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.440-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 11 18:19:31 computer ollama[67338]: INFO [main] build info | build=3535 commit="1e6f6554a" tid="127366016851008" timestamp=1726100371 Sep 11 18:19:31 computer ollama[67338]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127366016851008" timestamp=1726100371 total_threads=32 Sep 11 18:19:31 computer ollama[67338]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="33599" tid="127366016851008" timestamp=1726100371 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: loaded meta data with 34 key-value pairs and 322 tensors from /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 (version GGUF V3 (latest)) Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 0: general.architecture str = command-r Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 1: general.type str = model Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 2: general.name str = C4Ai Command R 08 2024 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 3: general.version str = 08-2024 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 4: general.basename str = c4ai-command-r Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 5: general.size_label str = 32B Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 6: general.license str = cc-by-nc-4.0 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 7: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 8: command-r.block_count u32 = 40 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 9: command-r.context_length u32 = 131072 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 10: command-r.embedding_length u32 = 8192 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 11: command-r.feed_forward_length u32 = 24576 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 12: command-r.attention.head_count u32 = 64 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 13: command-r.attention.head_count_kv u32 = 8 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 14: command-r.rope.freq_base f32 = 4000000.000000 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 15: command-r.attention.layer_norm_epsilon f32 = 0.000010 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 16: general.file_type u32 = 15 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 17: command-r.logit_scale f32 = 0.062500 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 18: command-r.rope.scaling.type str = none Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = command-r Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ... Sep 11 18:19:31 computer ollama[67030]: time=2024-09-11T18:19:31.691-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 5 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 255001 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = true Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 29: tokenizer.chat_template.tool_use str = {{ bos_token }}{% if messages[0]['rol... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 30: tokenizer.chat_template.rag str = {{ bos_token }}{% if messages[0]['rol... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 31: tokenizer.chat_templates arr[str,2] = ["tool_use", "rag"] Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - kv 33: general.quantization_version u32 = 2 Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type f32: 41 tensors Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type q4_K: 240 tensors Sep 11 18:19:31 computer ollama[67030]: llama_model_loader: - type q6_K: 41 tensors Sep 11 18:19:32 computer ollama[67030]: llm_load_vocab: special tokens cache size = 42 Sep 11 18:19:32 computer ollama[67030]: llm_load_vocab: token to piece cache size = 1.8428 MB Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: format = GGUF V3 (latest) Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: arch = command-r Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: vocab type = BPE Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_vocab = 256000 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_merges = 253333 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: vocab_only = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ctx_train = 131072 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd = 8192 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_layer = 40 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_head = 64 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_head_kv = 8 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_rot = 128 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_swa = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_head_k = 128 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_head_v = 128 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_gqa = 8 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_norm_eps = 1.0e-05 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: f_logit_scale = 6.2e-02 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ff = 24576 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_expert = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_expert_used = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: causal attn = 1 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: pooling type = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope type = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope scaling = none Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: freq_base_train = 4000000.0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: freq_scale_train = 1 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: rope_finetuned = unknown Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_conv = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_inner = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_d_state = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: ssm_dt_rank = 0 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model type = 35B Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model ftype = Q4_K - Medium Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model params = 32.30 B Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: model size = 18.43 GiB (4.90 BPW) Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: general.name = C4Ai Command R 08 2024 Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>' Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>' Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: PAD token = 0 '<PAD>' Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: LF token = 136 'Ä' Sep 11 18:19:32 computer ollama[67030]: llm_load_print_meta: max token length = 1024 Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 11 18:19:35 computer ollama[67030]: ggml_cuda_init: found 1 ROCm devices: Sep 11 18:19:35 computer ollama[67030]: Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no Sep 11 18:19:35 computer ollama[67030]: llm_load_tensors: ggml ctx size = 0.31 MiB Sep 11 18:19:36 computer ollama[67030]: time=2024-09-11T18:19:36.916-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 11 18:19:37 computer ollama[67030]: time=2024-09-11T18:19:37.232-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloading 40 repeating layers to GPU Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloading non-repeating layers to GPU Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: offloaded 41/41 layers to GPU Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: ROCm0 buffer size = 18873.16 MiB Sep 11 18:19:37 computer ollama[67030]: llm_load_tensors: ROCm_Host buffer size = 1640.62 MiB Sep 11 18:19:37 computer ollama[67030]: time=2024-09-11T18:19:37.683-06:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" Sep 11 18:19:47 computer ollama[67030]: time=2024-09-11T18:19:47.855-06:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Sep 11 18:19:47 computer ollama[67030]: [GIN] 2024/09/11 - 18:19:47 | 500 | 16.529325999s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ross-rosario commented on GitHub (Sep 12, 2024):

If it helps, here are the local ROCm-related packages. Also I have 32 GB of ram, out of which 20+ are free.

yay -Qs rocm
local/comgr 6.0.2-1
    Compiler support library for ROCm LLVM
local/hip-runtime-amd 6.0.2-4
    Heterogeneous Interface for Portability ROCm
local/hipblas 6.0.2-1
    ROCm BLAS marshalling library
local/hsa-rocr 6.0.2-2
    HSA Runtime API and runtime for ROCm
local/ollama-rocm 0.3.9-3
    Create, run and share large language models (LLMs) with ROCm
local/rocblas 6.0.2-1
    Next generation BLAS implementation for ROCm platform
local/rocm-core 6.0.2-2
    AMD ROCm core package (version files)
local/rocm-device-libs 6.0.2-1
    ROCm Device Libraries
local/rocm-llvm 6.0.2-1
    Radeon Open Compute - LLVM toolchain (llvm, clang, lld)
local/rocm-opencl-runtime 6.0.2-1
    OpenCL implementation for AMD
local/rocminfo 6.0.2-1
    ROCm Application for Reporting System Info
local/rocsolver 6.0.2-1
    Subset of LAPACK functionality on the ROCm platform
local/rocsparse 6.0.2-2
    BLAS for sparse computation on top of ROCm
<!-- gh-comment-id:2345019956 --> @ross-rosario commented on GitHub (Sep 12, 2024): If it helps, here are the local ROCm-related packages. Also I have 32 GB of ram, out of which 20+ are free. ``` yay -Qs rocm local/comgr 6.0.2-1 Compiler support library for ROCm LLVM local/hip-runtime-amd 6.0.2-4 Heterogeneous Interface for Portability ROCm local/hipblas 6.0.2-1 ROCm BLAS marshalling library local/hsa-rocr 6.0.2-2 HSA Runtime API and runtime for ROCm local/ollama-rocm 0.3.9-3 Create, run and share large language models (LLMs) with ROCm local/rocblas 6.0.2-1 Next generation BLAS implementation for ROCm platform local/rocm-core 6.0.2-2 AMD ROCm core package (version files) local/rocm-device-libs 6.0.2-1 ROCm Device Libraries local/rocm-llvm 6.0.2-1 Radeon Open Compute - LLVM toolchain (llvm, clang, lld) local/rocm-opencl-runtime 6.0.2-1 OpenCL implementation for AMD local/rocminfo 6.0.2-1 ROCm Application for Reporting System Info local/rocsolver 6.0.2-1 Subset of LAPACK functionality on the ROCm platform local/rocsparse 6.0.2-2 BLAS for sparse computation on top of ROCm ```
Author
Owner

@dhiltgen commented on GitHub (Sep 12, 2024):

It might be a different crash, but my suspicion is it's memory prediction related. To workaround this until we find the root cause, you can set num_gpu to a smaller value (try 40, 39, ...) or use the new env-var OLLAMA_GPU_OVERHEAD to reserve some VRAM so our algorithm calculates less layers to load. Let us know how many layers do load successfully, or if this turns out not to be the root cause and the crash isn't memory related.

<!-- gh-comment-id:2346658899 --> @dhiltgen commented on GitHub (Sep 12, 2024): It might be a different crash, but my suspicion is it's memory prediction related. To workaround this until we find the root cause, you can set `num_gpu` to a smaller value (try 40, 39, ...) or use the new env-var `OLLAMA_GPU_OVERHEAD` to reserve some VRAM so our algorithm calculates less layers to load. Let us know how many layers do load successfully, or if this turns out not to be the root cause and the crash isn't memory related.
Author
Owner

@ross-rosario commented on GitHub (Sep 12, 2024):

@dhiltgen would you point me to docs about those vars please?

Edit: I'll try OLLAMA_GPU_OVERHEAD as soon as I get my hands on ollama v0.3.10.

<!-- gh-comment-id:2346860426 --> @ross-rosario commented on GitHub (Sep 12, 2024): @dhiltgen would you point me to docs about those vars please? **Edit**: I'll try `OLLAMA_GPU_OVERHEAD` as soon as I get my hands on ollama v0.3.10.
Author
Owner

@dhiltgen commented on GitHub (Sep 13, 2024):

@remon-nashid you can run ollama serve --help to get a short description of the configuration variables, and some are discussed in more depth in various markdown files in the docs directory.

<!-- gh-comment-id:2349298023 --> @dhiltgen commented on GitHub (Sep 13, 2024): @remon-nashid you can run `ollama serve --help` to get a short description of the configuration variables, and some are discussed in more depth in various markdown files in the [docs directory](https://github.com/ollama/ollama/tree/main/docs).
Author
Owner

@ProjectMoon commented on GitHub (Sep 13, 2024):

Found this issue because I was experiencing a similar thing. Gemma2 Q4_K_M would load just fine some time ago, but now crashes in a conversation with a few back and forths between me and the AI. Setting GPU overhead to 1 GiB 'fixes' it. Loads 33 out of 47 layers.

<!-- gh-comment-id:2350055043 --> @ProjectMoon commented on GitHub (Sep 13, 2024): Found this issue because I was experiencing a similar thing. Gemma2 Q4_K_M would load just fine some time ago, but now crashes in a conversation with a few back and forths between me and the AI. Setting GPU overhead to 1 GiB 'fixes' it. Loads 33 out of 47 layers.
Author
Owner

@ross-rosario commented on GitHub (Sep 14, 2024):

Well, I've tried setting OLLAMA_GPU_OVERHEAD=1073741824 (the value is in bytes, right?) but the segmentation fault persists. Perhaps I need to try with different values.

btw that's on ollama 0.3.10

<!-- gh-comment-id:2350721172 --> @ross-rosario commented on GitHub (Sep 14, 2024): Well, I've tried setting `OLLAMA_GPU_OVERHEAD=1073741824` (the value is in bytes, right?) but the segmentation fault persists. Perhaps I need to try with different values. btw that's on ollama 0.3.10
Author
Owner

@ProjectMoon commented on GitHub (Sep 14, 2024):

The value is in bytes. Try increasing it until it doesn't crash. And if that doesn't work, manually lower num_gpu parameter in the modelfile until it loads.

<!-- gh-comment-id:2351007129 --> @ProjectMoon commented on GitHub (Sep 14, 2024): The value is in bytes. Try increasing it until it doesn't crash. And if that doesn't work, manually lower `num_gpu` parameter in the modelfile until it loads.
Author
Owner

@ross-rosario commented on GitHub (Sep 14, 2024):

Thanks @ProjectMoon for your help but I wonder, assuming that due to a bug, ollama doesn't calculate the available VRAM well, shouldn't it try to offload some layers to CPU? It used to behave that way in older versions but that as well seems to have regressed lately.

<!-- gh-comment-id:2351012667 --> @ross-rosario commented on GitHub (Sep 14, 2024): Thanks @ProjectMoon for your help but I wonder, assuming that due to a bug, ollama doesn't calculate the available VRAM well, shouldn't it try to offload some layers to CPU? It used to behave that way in older versions but that as well seems to have regressed lately.
Author
Owner

@ross-rosario commented on GitHub (Sep 15, 2024):

Just wanted to update you guys that I've tried an overhead up to 10 GB but still wouldn't load the model.

<!-- gh-comment-id:2351751999 --> @ross-rosario commented on GitHub (Sep 15, 2024): Just wanted to update you guys that I've tried an overhead up to 10 GB but still wouldn't load the model.
Author
Owner

@ProjectMoon commented on GitHub (Sep 15, 2024):

It should offload to CPU yes. Especially if you manually lower the number of GPU layers.

But if 10 GB overhead didn't let the model load, it sounds like something else is going on. Either overhead parameter not set correctly, not working in the code, or possibly something else entirely.

Are you using the ollama supplied ROCm or a system installed one?

<!-- gh-comment-id:2351821717 --> @ProjectMoon commented on GitHub (Sep 15, 2024): It should offload to CPU yes. Especially if you manually lower the number of GPU layers. But if 10 GB overhead didn't let the model load, it sounds like something else is going on. Either overhead parameter not set correctly, not working in the code, or possibly something else entirely. Are you using the ollama supplied ROCm or a system installed one?
Author
Owner

@kennethjyang commented on GitHub (Sep 17, 2024):

I'm experiencing this as well. I'm using

  • Ollama 0.3.10
  • AMD Radeon Pro W7900 (48 GB of VRAM)
  • Fedora 40 with the latest hipblas and rocm installed via dnf
    • I get the same thing with Ubuntu 22.04 LTS

I can load Llama 3.1 8B but I can't do any of the 70B models. I'll try the suggestions here, but I think there needs to be some way for Ollama to detect and deal with this.

In the logs I just get a message that says the LLM didn't "respond" and the exit sequence starts.

<!-- gh-comment-id:2354338889 --> @kennethjyang commented on GitHub (Sep 17, 2024): I'm experiencing this as well. I'm using - Ollama 0.3.10 - AMD Radeon Pro W7900 (48 GB of VRAM) - Fedora 40 with the latest hipblas and rocm installed via `dnf` - I get the same thing with Ubuntu 22.04 LTS I can load Llama 3.1 8B but I can't do any of the 70B models. I'll try the suggestions here, but I think there needs to be some way for Ollama to detect and deal with this. In the logs I just get a message that says the LLM didn't "respond" and the exit sequence starts.
Author
Owner

@kennethjyang commented on GitHub (Sep 17, 2024):

Update: this still did not work, although I can't tell if it did anything. I kept on increasing the GPU overhead and restarted the server but there was no difference in behavior.

<!-- gh-comment-id:2354699567 --> @kennethjyang commented on GitHub (Sep 17, 2024): Update: this still did not work, although I can't tell if it did anything. I kept on increasing the GPU overhead and restarted the server but there was no difference in behavior.
Author
Owner

@ross-rosario commented on GitHub (Sep 17, 2024):

Are you using the ollama supplied ROCm or a system installed one?

I'm not aware of that ollama supplied ROCm binaries. I have arch linux's package installed https://archlinux.org/packages/extra/x86_64/ollama-rocm/

<!-- gh-comment-id:2357219449 --> @ross-rosario commented on GitHub (Sep 17, 2024): > Are you using the ollama supplied ROCm or a system installed one? I'm not aware of that ollama supplied ROCm binaries. I have arch linux's package installed https://archlinux.org/packages/extra/x86_64/ollama-rocm/
Author
Owner

@kennethjyang commented on GitHub (Sep 18, 2024):

I'm using the Fedora-supplied ROCm packages. And it works since the log shows it was compatible. I can also run Llama 3.1 8b, just not 70b

<!-- gh-comment-id:2357233322 --> @kennethjyang commented on GitHub (Sep 18, 2024): I'm using the Fedora-supplied ROCm packages. And it works since the log shows it was compatible. I can also run Llama 3.1 8b, just not 70b
Author
Owner

@ross-rosario commented on GitHub (Sep 18, 2024):

@dhiltgen One interesting finding revealed by sending requests to ollama in parallel:

  • Send a request to two larger models, eg mistral-small:22b and internlm2:20b. Ollama would load one model, generate, unload it, load the next one, generate. Everything is fine.
  • Send a request to one larger model and two smaller models eg gemma2:2bm qwen2:1.5b and finally mistral-small:22b. Ollama loads and runs generation for the smaller models in parallel, by the time it gets to the larger model, segmentation fault occurs.
<!-- gh-comment-id:2359251437 --> @ross-rosario commented on GitHub (Sep 18, 2024): @dhiltgen One interesting finding revealed by sending requests to ollama in parallel: - Send a request to **two larger models**, eg `mistral-small:22b` and `internlm2:20b`. Ollama would load one model, generate, unload it, load the next one, generate. **Everything is fine**. - Send a request to **one larger model and two smaller models** eg `gemma2:2b`m `qwen2:1.5b` and finally `mistral-small:22b`. Ollama loads and runs generation for the smaller models in parallel, by the time it gets to the larger model, `segmentation fault` occurs.
Author
Owner

@ross-rosario commented on GitHub (Sep 20, 2024):

Any updates regarding this issue? Any other debugging info can we provide?

<!-- gh-comment-id:2364336600 --> @ross-rosario commented on GitHub (Sep 20, 2024): Any updates regarding this issue? Any other debugging info can we provide?
Author
Owner

@ross-rosario commented on GitHub (Sep 21, 2024):

I bit the bullet on downgrading to v0.3.6 and things are working again

ollama ps
NAME                        	ID          	SIZE 	PROCESSOR	UNTIL              
command-r:35b-08-2024-q4_K_M	376304b5a505	21 GB	100% GPU 	4 minutes from now
<!-- gh-comment-id:2365131815 --> @ross-rosario commented on GitHub (Sep 21, 2024): I bit the bullet on downgrading to `v0.3.6` and things are working again ``` ollama ps NAME ID SIZE PROCESSOR UNTIL command-r:35b-08-2024-q4_K_M 376304b5a505 21 GB 100% GPU 4 minutes from now ```
Author
Owner

@unclemusclez commented on GitHub (Sep 24, 2024):

me too

ollama start
2024/09/24 01:50:45 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-09-24T01:50:45.820Z level=INFO source=images.go:753 msg="total blobs: 5"
time=2024-09-24T01:50:45.820Z level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-24T01:50:45.821Z level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.11)"
time=2024-09-24T01:50:45.821Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4127840924/runners
time=2024-09-24T01:51:01.096Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-09-24T01:51:01.096Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-09-24T01:51:01.156Z level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-09-24T01:51:01.159Z level=INFO source=amd_linux.go:346 msg="amdgpu is supported" gpu=0 gpu_type=gfx906
time=2024-09-24T01:51:01.159Z level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx906 driver=0.0 name=1002:66a3 total="32.0 GiB" available="32.0 GiB"
[GIN] 2024/09/24 - 01:51:35 | 200 |      49.903µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/09/24 - 01:51:35 | 200 |   38.215884ms |       127.0.0.1 | POST     "/api/show"
time=2024-09-24T01:51:35.308Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 gpu=0 parallel=4 available=34332102656 required="28.6 GiB"
time=2024-09-24T01:51:35.308Z level=INFO source=server.go:103 msg="system memory" total="11.7 GiB" free="8.1 GiB" free_swap="4.0 GiB"
time=2024-09-24T01:51:35.312Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.6 GiB" memory.required.partial="28.6 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[28.6 GiB]" memory.weights.total="25.9 GiB" memory.weights.repeating="25.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-09-24T01:51:35.316Z level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama4127840924/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --port 38195"
time=2024-09-24T01:51:35.317Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-09-24T01:51:35.317Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-09-24T01:51:35.317Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=10 commit="44ccccd" tid="134189328024384" timestamp=1727142695
INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134189328024384" timestamp=1727142695 total_threads=8
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="38195" tid="134189328024384" timestamp=1727142695
llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 70B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 70B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 80
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 10
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-09-24T01:51:35.570Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q2_K:  321 tensors
llama_model_loader: - type q3_K:  160 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 24.56 GiB (2.99 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 70B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no
llm_load_tensors: ggml ctx size =    0.68 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 24817.00 MiB
llm_load_tensors:  ROCm_Host buffer size =   328.78 MiB
time=2024-09-24T01:51:38.990Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
time=2024-09-24T01:51:39.241Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
<!-- gh-comment-id:2369962865 --> @unclemusclez commented on GitHub (Sep 24, 2024): me too ``` ollama start 2024/09/24 01:50:45 routes.go:1153: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-09-24T01:50:45.820Z level=INFO source=images.go:753 msg="total blobs: 5" time=2024-09-24T01:50:45.820Z level=INFO source=images.go:760 msg="total unused blobs removed: 0" time=2024-09-24T01:50:45.821Z level=INFO source=routes.go:1200 msg="Listening on 127.0.0.1:11434 (version 0.3.11)" time=2024-09-24T01:50:45.821Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama4127840924/runners time=2024-09-24T01:51:01.096Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v60102 cpu cpu_avx cpu_avx2 cuda_v11]" time=2024-09-24T01:51:01.096Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs" time=2024-09-24T01:51:01.156Z level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" time=2024-09-24T01:51:01.159Z level=INFO source=amd_linux.go:346 msg="amdgpu is supported" gpu=0 gpu_type=gfx906 time=2024-09-24T01:51:01.159Z level=INFO source=types.go:107 msg="inference compute" id=0 library=rocm variant="" compute=gfx906 driver=0.0 name=1002:66a3 total="32.0 GiB" available="32.0 GiB" [GIN] 2024/09/24 - 01:51:35 | 200 | 49.903µs | 127.0.0.1 | HEAD "/" [GIN] 2024/09/24 - 01:51:35 | 200 | 38.215884ms | 127.0.0.1 | POST "/api/show" time=2024-09-24T01:51:35.308Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 gpu=0 parallel=4 available=34332102656 required="28.6 GiB" time=2024-09-24T01:51:35.308Z level=INFO source=server.go:103 msg="system memory" total="11.7 GiB" free="8.1 GiB" free_swap="4.0 GiB" time=2024-09-24T01:51:35.312Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.6 GiB" memory.required.partial="28.6 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[28.6 GiB]" memory.weights.total="25.9 GiB" memory.weights.repeating="25.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" time=2024-09-24T01:51:35.316Z level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama4127840924/runners/rocm_v60102/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --no-mmap --parallel 4 --port 38195" time=2024-09-24T01:51:35.317Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-09-24T01:51:35.317Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-09-24T01:51:35.317Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=10 commit="44ccccd" tid="134189328024384" timestamp=1727142695 INFO [main] system info | n_threads=4 n_threads_batch=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134189328024384" timestamp=1727142695 total_threads=8 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="38195" tid="134189328024384" timestamp=1727142695 llama_model_loader: loaded meta data with 29 key-value pairs and 724 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ba1103315c449ad06c9f5fd94230bde5bcf977f794af70afb107d29153c3cd53 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 70B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 70B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 80 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 8192 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 13: llama.attention.head_count u32 = 64 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 10 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2024-09-24T01:51:35.570Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 162 tensors llama_model_loader: - type q2_K: 321 tensors llama_model_loader: - type q3_K: 160 tensors llama_model_loader: - type q5_K: 80 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q2_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 24.56 GiB (2.99 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 70B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 9.0, VMM: no llm_load_tensors: ggml ctx size = 0.68 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: ROCm0 buffer size = 24817.00 MiB llm_load_tensors: ROCm_Host buffer size = 328.78 MiB time=2024-09-24T01:51:38.990Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" time=2024-09-24T01:51:39.241Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" ```
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@remon-nashid so it sounds like this is not actually a memory prediction error, but a crash due to some incompatibility between the driver, rocm, and llama.cpp code. It sounds like you're on archlinux, so installing the AMD downstream latest driver isn't an option for you and you have to stick on the older older driver bundled into the linux kernel. (We've found the AMD downstream driver is more reliable in our testing, but has limited distro support). It looks like you're setting HSA_OVERRIDE_GFX_VERSION=11.0.0 but that IS the gfx of your GPU, so I'm a little confused why you're doing that, as it shouldn't be necessary. That setting disables some of our safeguard checks to verify the ROCm library being used supports the detected GPU.

I have arch linux's package installed

I can't speak to how these are built, so given it looks like we're dealing with a runtime incompatibility/mismatch between driver/rocm/ollama, can you try using our official build with our bundled ROCm? If that fixes it, then we can shift this issue over to the archlinux maintainers of the package. If it still fails, then we know it's something else.

<!-- gh-comment-id:2371748115 --> @dhiltgen commented on GitHub (Sep 24, 2024): @remon-nashid so it sounds like this is not actually a memory prediction error, but a crash due to some incompatibility between the driver, rocm, and llama.cpp code. It sounds like you're on archlinux, so installing the AMD downstream latest driver isn't an option for you and you have to stick on the older older driver bundled into the linux kernel. (We've found the AMD downstream driver is more reliable in our testing, but has limited distro support). It looks like you're setting `HSA_OVERRIDE_GFX_VERSION=11.0.0` but that IS the gfx of your GPU, so I'm a little confused why you're doing that, as it shouldn't be necessary. That setting disables some of our safeguard checks to verify the ROCm library being used supports the detected GPU. > I have arch linux's package installed I can't speak to how these are built, so given it looks like we're dealing with a runtime incompatibility/mismatch between driver/rocm/ollama, can you try using our official build with our bundled ROCm? If that fixes it, then we can shift this issue over to the archlinux maintainers of the package. If it still fails, then we know it's something else.
Author
Owner

@unclemusclez commented on GitHub (Sep 24, 2024):

Does ollama update itself? I have to cp 0.3.6 over each time to get it to start without segfault. I have no idea why that would be. I'm copying it to /usr/share/ollama

<!-- gh-comment-id:2372040470 --> @unclemusclez commented on GitHub (Sep 24, 2024): Does ollama update itself? I have to cp 0.3.6 over each time to get it to start without segfault. I have no idea why that would be. I'm copying it to `/usr/share/ollama`
Author
Owner

@RayNguyen842857 commented on GitHub (Sep 25, 2024):

I'm running a 7600 XT on NixOS and having the same issue on both 0.3.10 and 0.3.11. Downgrading to 0.3.9 temporarily fixed the issue for me...

<!-- gh-comment-id:2372872696 --> @RayNguyen842857 commented on GitHub (Sep 25, 2024): I'm running a 7600 XT on NixOS and having the same issue on both 0.3.10 and 0.3.11. Downgrading to 0.3.9 temporarily fixed the issue for me...
Author
Owner

@ProjectMoon commented on GitHub (Sep 25, 2024):

@dhiltgen One interesting finding revealed by sending requests to ollama in parallel:

* Send a request to **two larger models**, eg `mistral-small:22b` and `internlm2:20b`. Ollama would load one model, generate, unload it, load the next one, generate. **Everything is fine**.

* Send a request to **one larger model and two smaller models** eg `gemma2:2b`m `qwen2:1.5b` and finally `mistral-small:22b`. Ollama loads and runs generation for the smaller models in parallel, by the time it gets to the larger model, `segmentation fault` occurs.

I've noticed behavior like this too. It will sometimes seemingly get confused about how it should load or unload models, and try to load a model into the GPU while the GPU is full. Only happens in parallel processing situations. I've had it with two larger models (14 GB+ VRAM) on my main AMD GPU while a 3rd smaller model is running on my secondary ancient NVidia GPU.

<!-- gh-comment-id:2373172660 --> @ProjectMoon commented on GitHub (Sep 25, 2024): > @dhiltgen One interesting finding revealed by sending requests to ollama in parallel: > > * Send a request to **two larger models**, eg `mistral-small:22b` and `internlm2:20b`. Ollama would load one model, generate, unload it, load the next one, generate. **Everything is fine**. > > * Send a request to **one larger model and two smaller models** eg `gemma2:2b`m `qwen2:1.5b` and finally `mistral-small:22b`. Ollama loads and runs generation for the smaller models in parallel, by the time it gets to the larger model, `segmentation fault` occurs. I've noticed behavior like this too. It will sometimes seemingly get confused about how it should load or unload models, and try to load a model into the GPU while the GPU is full. Only happens in parallel processing situations. I've had it with two larger models (14 GB+ VRAM) on my main AMD GPU while a 3rd smaller model is running on my secondary ancient NVidia GPU.
Author
Owner

@MaciejMogilany commented on GitHub (Sep 25, 2024):

On My AMD APU, a segmentation fault appears whenever --no-mmap is applied in Linux. Ollama adds it on memory constraints situations. This fits the scenarios described above.

	if (runtime.GOOS == "windows" && gpus[0].Library == "cuda" && opts.UseMMap == nil) ||
		(runtime.GOOS == "linux" && systemFreeMemory < estimate.TotalSize && opts.UseMMap == nil) ||
		(gpus[0].Library == "cpu" && opts.UseMMap == nil) ||
		(opts.UseMMap != nil && !*opts.UseMMap) {
		params = append(params, "--no-mmap")
	}

On https://github.com/ollama/ollama/pull/6282 every model even small:360M gives seq fault on APU because --no-mmap is used there all the time to avoid duplicating memory on shared memory that APU uses. I will bet that recent llama.cpp update changes something.

EDIT: compiled llama.cpp git 8962422 that current ollama uses and it works with --no-mmap without any error.

<!-- gh-comment-id:2373466549 --> @MaciejMogilany commented on GitHub (Sep 25, 2024): On My AMD APU, a segmentation fault appears whenever --no-mmap is applied in Linux. Ollama adds it on memory constraints situations. This fits the scenarios described above. ``` if (runtime.GOOS == "windows" && gpus[0].Library == "cuda" && opts.UseMMap == nil) || (runtime.GOOS == "linux" && systemFreeMemory < estimate.TotalSize && opts.UseMMap == nil) || (gpus[0].Library == "cpu" && opts.UseMMap == nil) || (opts.UseMMap != nil && !*opts.UseMMap) { params = append(params, "--no-mmap") } ``` On [https://github.com/ollama/ollama/pull/6282](https://github.com/ollama/ollama/pull/6282) every model even small:360M gives seq fault on APU because --no-mmap is used there all the time to avoid duplicating memory on shared memory that APU uses. ~~I will bet that [recent llama.cpp update](https://github.com/ollama/ollama/pull/6618) changes something.~~ EDIT: compiled llama.cpp git 8962422 that current ollama uses and it works with --no-mmap without any error.
Author
Owner

@ProjectMoon commented on GitHub (Sep 25, 2024):

That might explain why I had to turn mmap back on for a bunch of models.

<!-- gh-comment-id:2373675475 --> @ProjectMoon commented on GitHub (Sep 25, 2024): That might explain why I had to turn mmap back on for a bunch of models.
Author
Owner

@MaciejMogilany commented on GitHub (Sep 25, 2024):

--no-mmap

--no-mmap allows loading big models on 3.9 ollama without OOM.

There is another recent issue but from Nvidia card that has this pattern no-mmap and cuda out of memory 6864

<!-- gh-comment-id:2374873763 --> @MaciejMogilany commented on GitHub (Sep 25, 2024): > --no-mmap --no-mmap allows loading big models on 3.9 ollama without OOM. There is another recent issue but from Nvidia card that has this pattern no-mmap and cuda out of memory [6864](https://github.com/ollama/ollama/issues/6864)
Author
Owner

@unclemusclez commented on GitHub (Sep 25, 2024):

i'm running 70b without --no-map on 0.3.9

<!-- gh-comment-id:2375180707 --> @unclemusclez commented on GitHub (Sep 25, 2024): i'm running 70b without --no-map on 0.3.9
Author
Owner

@MaciejMogilany commented on GitHub (Sep 26, 2024):

i'm running 70b without --no-map on 0.3.9

Because if fit on GPU VRAM

time=2024-09-24T01:51:35.312Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.6 GiB" memory.required.partial="28.6 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[28.6 GiB]" memory.weights.total="25.9 GiB" memory.weights.repeating="25.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"

Partial offload on memory constrain situations can make memory paging indefinitely on Linux. And there are other situations like for APU I am able to load to GTT memory mistral large q4 with --no-mmap flag, without it max is mistral large q2K on 80Gib GTT, 96Gib RAM. Without it CPU buffer often became of a size of whole model giving 2x memory requirements one ROCM buffer one CPU buffer. But this is a problem specific to APU shared memory (VRAM is on RAM).

<!-- gh-comment-id:2376134828 --> @MaciejMogilany commented on GitHub (Sep 26, 2024): > i'm running 70b without --no-map on 0.3.9 Because if fit on GPU VRAM ``` time=2024-09-24T01:51:35.312Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[32.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="28.6 GiB" memory.required.partial="28.6 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[28.6 GiB]" memory.weights.total="25.9 GiB" memory.weights.repeating="25.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" ``` Partial offload on memory constrain situations can make memory paging indefinitely on Linux. And there are other situations like for APU I am able to load to GTT memory mistral large q4 with --no-mmap flag, without it max is mistral large q2K on 80Gib GTT, 96Gib RAM. Without it CPU buffer often became of a size of whole model giving 2x memory requirements one ROCM buffer one CPU buffer. But this is a problem specific to APU shared memory (VRAM is on RAM).
Author
Owner

@waltercool commented on GitHub (Sep 26, 2024):

Hi everyone,

The way I been avoiding some crashes (Gentoo related issue https://github.com/ollama/ollama/issues/6857) is using Ollama bundled libraries (LD_LIBRARY_PATH).

While not ideal, it works fine. https://github.com/ollama/ollama/releases/download/v0.3.12/ollama-linux-amd64-rocm.tgz

This is the way I was able to run my tests for PR 6282 https://github.com/ollama/ollama/pull/6282#issuecomment-2375832252

<!-- gh-comment-id:2377408649 --> @waltercool commented on GitHub (Sep 26, 2024): Hi everyone, The way I been avoiding some crashes (Gentoo related issue https://github.com/ollama/ollama/issues/6857) is using Ollama bundled libraries (LD_LIBRARY_PATH). While not ideal, it works fine. https://github.com/ollama/ollama/releases/download/v0.3.12/ollama-linux-amd64-rocm.tgz This is the way I was able to run my tests for PR 6282 https://github.com/ollama/ollama/pull/6282#issuecomment-2375832252
Author
Owner

@olly1240 commented on GitHub (Sep 26, 2024):

For me both system libraries and the ones bundled in the zip file results in the same segfault behavior, it crashes even with OLLAMA_GPU_OVERHEAD set to ridiculous amounts (3gb), with a dGPU with 12gb ram. Previous releases do work fine for me. Arch with packaged rocm 6.0.2

<!-- gh-comment-id:2378077476 --> @olly1240 commented on GitHub (Sep 26, 2024): For me both system libraries and the ones bundled in the zip file results in the same segfault behavior, it crashes even with OLLAMA_GPU_OVERHEAD set to ridiculous amounts (3gb), with a dGPU with 12gb ram. Previous releases do work fine for me. Arch with packaged rocm 6.0.2
Author
Owner

@unclemusclez commented on GitHub (Sep 26, 2024):

i notice that the libraries definitely have something to do with it. If i compile from soruce it is ok, but i still have to restart ollama to get it to not use the packaged libraries from a previous install, i think

the ENV variables probably have something to do with this. also, the LDs are an extra 7GB so to not use them would be ideal.

<!-- gh-comment-id:2378095619 --> @unclemusclez commented on GitHub (Sep 26, 2024): i notice that the libraries definitely have something to do with it. If i compile from soruce it is ok, but i still have to restart ollama to get it to not use the packaged libraries from a previous install, i think the ENV variables probably have something to do with this. also, the LDs are an extra 7GB so to not use them would be ideal.
Author
Owner

@ross-rosario commented on GitHub (Oct 4, 2024):

I can't speak to how these are built, so given it looks like we're dealing with a runtime incompatibility/mismatch between driver/rocm/ollama, can you try using our official build with our bundled ROCm? If that fixes it, then we can shift this issue over to the archlinux maintainers of the package. If it still fails, then we know it's something else.

I installed the official build after uninstalling the arch package but unfortunately, the same error is still reproducible.

<!-- gh-comment-id:2392641728 --> @ross-rosario commented on GitHub (Oct 4, 2024): > I can't speak to how these are built, so given it looks like we're dealing with a runtime incompatibility/mismatch between driver/rocm/ollama, can you try using our official build with our bundled ROCm? If that fixes it, then we can shift this issue over to the archlinux maintainers of the package. If it still fails, then we know it's something else. I installed the official build after uninstalling the arch package but unfortunately, the same error is still reproducible.
Author
Owner

@carsoncall commented on GitHub (Oct 4, 2024):

+1 for the segmentation fault. I could run the 8b model, but not the 70b.
OS: Fedora 40 (Sway)
GPU: AMD RX 6800
Model: Llama3.1:70b
Ollama version: 3.12

I first tried installing Ollama's packaged ROCm drivers. This didn't seem to make a difference.

I tried Ollama version 3.9, and that was broken as well. Downgrading to 3.6 fixed it.

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.3.6 sh

<!-- gh-comment-id:2394793455 --> @carsoncall commented on GitHub (Oct 4, 2024): +1 for the segmentation fault. I could run the 8b model, but not the 70b. OS: Fedora 40 (Sway) GPU: AMD RX 6800 Model: Llama3.1:70b Ollama version: 3.12 I first tried installing Ollama's packaged ROCm drivers. This didn't seem to make a difference. I tried Ollama version 3.9, and that was broken as well. Downgrading to 3.6 fixed it. `curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.3.6 sh`
Author
Owner

@tbsteinb commented on GitHub (Oct 6, 2024):

+1 here as well.

I'm using the Docker image. I can confirm that if I roll back to 3.6 everything seems to work fine.

Host OS: Gentoo
GPU: AMD RX 7900 XTX
Model: Mixtral:8x7b
Ollama version: ollama:rocm (3.12)

Using ollama:0.3.6-rocm seems to address the issue. Setting OLLAMA_GPU_OVERHEAD would occasionally work if I set it to ludicrously high values (20 GB), but the end result was the model would (unsurprisingly) run effectively entirely on the CPU. It would only work if I used really high values like that.

<!-- gh-comment-id:2395603075 --> @tbsteinb commented on GitHub (Oct 6, 2024): +1 here as well. I'm using the Docker image. I can confirm that if I roll back to 3.6 everything seems to work fine. Host OS: Gentoo GPU: AMD RX 7900 XTX Model: Mixtral:8x7b Ollama version: ollama:rocm (3.12) Using ollama:0.3.6-rocm seems to address the issue. Setting OLLAMA_GPU_OVERHEAD would occasionally work if I set it to ludicrously high values (20 GB), but the end result was the model would (unsurprisingly) run effectively entirely on the CPU. It would only work if I used really high values like that.
Author
Owner

@ross-rosario commented on GitHub (Oct 6, 2024):

So obviously one of the changes here v0.3.6...v0.3.8 introduced that bug, but hard for my untrained eyes to spot it.

Might as well be narrowed down to v0.3.6...v0.3.7, but I didn't test v0.3.7.

<!-- gh-comment-id:2395623738 --> @ross-rosario commented on GitHub (Oct 6, 2024): So obviously one of the changes here [v0.3.6...v0.3.8](https://github.com/ollama/ollama/compare/v0.3.6...v0.3.8) introduced that bug, but hard for my untrained eyes to spot it. Might as well be narrowed down to v0.3.6...v0.3.7, but I didn't test v0.3.7.
Author
Owner

@unclemusclez commented on GitHub (Oct 7, 2024):

this could also be a rocm 6.2.X issue. i think the timeline matches up apparently not

<!-- gh-comment-id:2396723603 --> @unclemusclez commented on GitHub (Oct 7, 2024): ~~this could also be a rocm 6.2.X issue. i think the timeline matches up~~ apparently not
Author
Owner

@olly1240 commented on GitHub (Oct 7, 2024):

this could also be a rocm 6.2.X issue. i think the timeline matches up

This happens also on 6.0.2 so it might not be that

<!-- gh-comment-id:2396731458 --> @olly1240 commented on GitHub (Oct 7, 2024): > this could also be a rocm 6.2.X issue. i think the timeline matches up This happens also on 6.0.2 so it might not be that
Author
Owner

@unclemusclez commented on GitHub (Oct 7, 2024):

does anyone know if this exists on the most recent versions of llama.cpp?

<!-- gh-comment-id:2396739124 --> @unclemusclez commented on GitHub (Oct 7, 2024): does anyone know if this exists on the most recent versions of llama.cpp?
Author
Owner

@olly1240 commented on GitHub (Oct 7, 2024):

does anyone know if this exists on the most recent versions of llama.cpp?

I quickly ran koboldcpp-rocm with a model that gives me segfaults on ollama and it seems it's working

<!-- gh-comment-id:2396819343 --> @olly1240 commented on GitHub (Oct 7, 2024): > does anyone know if this exists on the most recent versions of llama.cpp? I quickly ran koboldcpp-rocm with a model that gives me segfaults on ollama and it seems it's working
Author
Owner

@unclemusclez commented on GitHub (Oct 8, 2024):

i ran sudo apt install libclblast-dev and it is working. i recompiled from source. gfx906

<!-- gh-comment-id:2399605002 --> @unclemusclez commented on GitHub (Oct 8, 2024): i ran `sudo apt install libclblast-dev` and it is working. i recompiled from source. `gfx906`
Author
Owner

@dhiltgen commented on GitHub (Oct 8, 2024):

It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. 0b03b9c32f

<!-- gh-comment-id:2400976227 --> @dhiltgen commented on GitHub (Oct 8, 2024): It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. https://github.com/ollama/ollama/commit/0b03b9c32f483be2d7a4e902d13a909b546ae6bf
Author
Owner

@olly1240 commented on GitHub (Oct 9, 2024):

It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. 0b03b9c

Can confirm. ollama 3.6 works, 3.7 segfaults

<!-- gh-comment-id:2401165592 --> @olly1240 commented on GitHub (Oct 9, 2024): > It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. [0b03b9c](https://github.com/ollama/ollama/commit/0b03b9c32f483be2d7a4e902d13a909b546ae6bf) Can confirm. ollama 3.6 works, 3.7 segfaults
Author
Owner

@unclemusclez commented on GitHub (Oct 9, 2024):

It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. 0b03b9c

I've notice that the normal install for Ollama AMD will install it's own version of lib/rocm. When deleted, I was able to use up to Ollama 0.3.9 when compiled from source.

I managed to get Ollama 0.3.12 to run, yesterday from source. it Ran 70B, but after i restarted my computer, i had to revert back to 0.3.9.

I mentioned similar issues above a couple of weeks ago where it seems like this is an environment variable issue. I don't understand how newer versions would be able to run otherwise. Perhaps during the build phase the environment is properly set, but upon rebuild, or environment restart, the variable gets lost?

I just want to confirm, i currently have 0.3.9 (compiled from source and with the 0.3.6 lib folder deleted) working live on Ubuntu 24.04 with ROCm 6.2.1.
0.3.6 works out of the box, and I BELIEVE i managed to get 0.3.12 to work, but i cant replicate this.

<!-- gh-comment-id:2402112315 --> @unclemusclez commented on GitHub (Oct 9, 2024): > It sounds like some people seeing the regression are able to run v0.3.6. Does anyone have confirmation the regression was in v0.3.7? If so, it may be this commit, which brought back a compile flag that fixed problems for some people with multiple radeon cards, but if that's causing regressions for a broader set of people, maybe we need to back that out. [0b03b9c](https://github.com/ollama/ollama/commit/0b03b9c32f483be2d7a4e902d13a909b546ae6bf) I've notice that the normal install for Ollama AMD will install it's own version of `lib/rocm`. When deleted, I was able to use up to Ollama 0.3.9 when compiled from source. I managed to get Ollama 0.3.12 to run, yesterday from source. it Ran 70B, but after i restarted my computer, i had to revert back to 0.3.9. I mentioned similar issues above a couple of weeks ago where it seems like this is an environment variable issue. I don't understand how newer versions would be able to run otherwise. Perhaps during the build phase the environment is properly set, but upon rebuild, or environment restart, the variable gets lost? I just want to confirm, i currently have 0.3.9 (compiled from source and with the 0.3.6 lib folder deleted) working live on Ubuntu 24.04 with ROCm 6.2.1. 0.3.6 works out of the box, and I BELIEVE i managed to get 0.3.12 to work, but i cant replicate this.
Author
Owner

@dhiltgen commented on GitHub (Oct 9, 2024):

There might be multiple issues lurking in here. The fact that OLLAMA_GPU_OVERHEAD doesn't seem to help implies it's not a simple over-allocation, but a more subtle crash in llama.cpp or ROCm, possibly specific to a combination of ROCm versions and driver versions.

I haven't managed to reproduce so far. What would be helpful to try to narrow this down - could folks who are seeing these crashes enumerate the following details so we can try to see if we can repro?

  • Linux vs. Windows - for linux, what distro and version
  • What GPU
  • How much system RAM
  • AMD Driver version
    • for Linux, was it the upstream kernel bundled amdgpu and if so, what kernel version
    • or the downstream AMD newer driver, and if so, what amdgpu driver version
  • ROCm version - bundled from Ollama, or "bring your own" and if BYO, what version
  • Which model causes a crash, and did you use any custom parameters or is a simple ollama run somemodel sufficient to trigger the crash
  • Any custom env vars that you're using

The volume of logs is intense, but it may also be helpful to run the server with AMD_LOG_LEVEL=3 and share the logs around the time of the crash in case there's any interesting warnings/errors from the ROCm logging.

Please make sure you're running the latest Ollama version so we're not chasing ghosts that were already fixed.

<!-- gh-comment-id:2402740954 --> @dhiltgen commented on GitHub (Oct 9, 2024): There might be multiple issues lurking in here. The fact that OLLAMA_GPU_OVERHEAD doesn't seem to help implies it's not a simple over-allocation, but a more subtle crash in llama.cpp or ROCm, possibly specific to a combination of ROCm versions and driver versions. I haven't managed to reproduce so far. What would be helpful to try to narrow this down - could folks who are seeing these crashes enumerate the following details so we can try to see if we can repro? - Linux vs. Windows - for linux, what distro and version - What GPU - How much system RAM - AMD Driver version - for Linux, was it the upstream kernel bundled amdgpu and if so, what kernel version - or the downstream AMD newer driver, and if so, what amdgpu driver version - ROCm version - bundled from Ollama, or "bring your own" and if BYO, what version - Which model causes a crash, and did you use any custom parameters or is a simple `ollama run somemodel` sufficient to trigger the crash - Any custom env vars that you're using The volume of logs is intense, but it may also be helpful to run the server with `AMD_LOG_LEVEL=3` and share the logs around the time of the crash in case there's any interesting warnings/errors from the ROCm logging. Please make sure you're running the latest Ollama version so we're not chasing ghosts that were already fixed.
Author
Owner

@paulchevalier commented on GitHub (Oct 10, 2024):

Didn't comment earlier, but I believe I have been getting this issue as well

  • Archlinux (was a problem already 2 weeks ago, but still valid with up to date packages), running Kernel 6.11 with the amdgpu driver. Not an option to run the AMD driver on Arch.
  • GPU RX6600, 8GB VRAM ( I am using HSA_OVERRIDE_GFX_VERSION=10.3.0, I know the GPU is technically not supported)
  • CPU is an old i7 2600K, only 8GB system ram
  • Tried various ROCM version: the one from arch (6.0.2), the one from aur (6.2.1?), and the one packaged by ollama with no luck in either case.
  • I tried rebuilding from source using the ollama-rocm-git AUR package with no better luck.
  • All models I have tried (llama3.2, llama3.2:1b, llama3.1, phi3:mini ) are causing this error.
    -Environment variables in the systemd file: HSA_OVERRIDE_GFX_VERSION=10.3.0 OLLAMA_DEBUG=1 AMD_LOG_LEVEL=3

Please find attached the log I get when trying to run llama3.2:1b
ollama.log

From what I understand in the log, the crash happens after just after the model is done loading in the GPU.

All these model run fine in CPU (albeit slowly) if I remove the HSA_OVERRIDE_GFX_VERSION variable to disable the GPU.

<!-- gh-comment-id:2403731147 --> @paulchevalier commented on GitHub (Oct 10, 2024): Didn't comment earlier, but I believe I have been getting this issue as well - Archlinux (was a problem already 2 weeks ago, but still valid with up to date packages), running Kernel 6.11 with the amdgpu driver. Not an option to run the AMD driver on Arch. - GPU RX6600, 8GB VRAM ( I am using HSA_OVERRIDE_GFX_VERSION=10.3.0, I know the GPU is technically not supported) - CPU is an old i7 2600K, only 8GB system ram - Tried various ROCM version: the one from arch (6.0.2), the one from aur (6.2.1?), and the one packaged by ollama with no luck in either case. - I tried rebuilding from source using the ollama-rocm-git AUR package with no better luck. - All models I have tried (llama3.2, llama3.2:1b, llama3.1, phi3:mini ) are causing this error. -Environment variables in the systemd file: HSA_OVERRIDE_GFX_VERSION=10.3.0 OLLAMA_DEBUG=1 AMD_LOG_LEVEL=3 Please find attached the log I get when trying to run llama3.2:1b [ollama.log](https://github.com/user-attachments/files/17320537/ollama.log) From what I understand in the log, the crash happens after just after the model is done loading in the GPU. All these model run fine in CPU (albeit slowly) if I remove the HSA_OVERRIDE_GFX_VERSION variable to disable the GPU.
Author
Owner

@weak-kajuma commented on GitHub (Oct 10, 2024):

I have same issue.
My environment:

  • 2x MI25
  • Ubuntu Server 24.04.1 LTS
  • rocm version 6.2.2.60202-116

I tried some versions with docker in my environment.

22b 27b
0.3.12-rocm unavailable unavailable
0.3.11-rocm unavailable unavailable
0.3.10-rocm unavailable unavailable
0.3.9-rocm unavailable unavailable
0.3.8-rocm unavailable unavailable
0.3.7-rocm unavailable unavailable
0.3.6-rocm work on 1 gpu work on 2 gpus

Using models:

  • 22b: schroneko/calm3-22b-chat:q4_0
  • 27b: gemma2:27b

I wish it run on latest version.

<!-- gh-comment-id:2404836048 --> @weak-kajuma commented on GitHub (Oct 10, 2024): I have same issue. My environment: - 2x MI25 - Ubuntu Server 24.04.1 LTS - rocm version 6.2.2.60202-116 I tried some versions with docker in my environment. ||22b|27b| |---|---|---| |0.3.12-rocm|unavailable|unavailable| |0.3.11-rocm|unavailable|unavailable| |0.3.10-rocm|unavailable|unavailable| |0.3.9-rocm|unavailable|unavailable| |0.3.8-rocm|unavailable|unavailable| |0.3.7-rocm|unavailable|unavailable| |0.3.6-rocm|work on 1 gpu|work on 2 gpus| Using models: - 22b: schroneko/calm3-22b-chat:q4_0 - 27b: gemma2:27b I wish it run on latest version.
Author
Owner

@unclemusclez commented on GitHub (Oct 11, 2024):

0.3.13 branch Ubuntu 24.04 ROCm 6.2.1 Failed

× ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Fri 2024-10-11 00:45:43 UTC; 1h 19min ago
   Duration: 1d 21h 22min 46.223s
    Process: 16029 ExecStart=/usr/local/bin/ollama0.3.13 serve (code=exited, status=0/SUCCESS)
    Process: 16030 ExecStartPost=/usr/local/bin/ollama0.3.13 run llama3.1:70b-instruct-q2_K < /dev/null (code=exited, status=1/FAILURE)
   Main PID: 16029 (code=exited, status=0/SUCCESS)
        CPU: 3.789s

Oct 11 00:45:42 kamala ollama0.3.13[16029]: llm_load_tensors:      ROCm0 buffer size = 24817.00 MiB
Oct 11 00:45:42 kamala ollama0.3.13[16029]: llm_load_tensors:  ROCm_Host buffer size =   328.78 MiB
Oct 11 00:45:42 kamala ollama0.3.13[16029]: time=2024-10-11T00:45:42.816Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Oct 11 00:45:43 kamala ollama0.3.13[16029]: time=2024-10-11T00:45:43.066Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Oct 11 00:45:43 kamala ollama0.3.13[16029]: [GIN] 2024/10/11 - 00:45:43 | 500 |  3.028948339s |       127.0.0.1 | POST     "/api/generate"
Oct 11 00:45:43 kamala ollama0.3.13[16030]: [836B blob data]
Oct 11 00:45:43 kamala systemd[1]: ollama.service: Control process exited, code=exited, status=1/FAILURE
Oct 11 00:45:43 kamala systemd[1]: ollama.service: Failed with result 'exit-code'.
Oct 11 00:45:43 kamala systemd[1]: Failed to start ollama.service - Ollama Service.
Oct 11 00:45:43 kamala systemd[1]: ollama.service: Consumed 3.789s CPU time.
<!-- gh-comment-id:2406399029 --> @unclemusclez commented on GitHub (Oct 11, 2024): 0.3.13 branch Ubuntu 24.04 ROCm 6.2.1 Failed ```shell × ollama.service - Ollama Service Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) Active: failed (Result: exit-code) since Fri 2024-10-11 00:45:43 UTC; 1h 19min ago Duration: 1d 21h 22min 46.223s Process: 16029 ExecStart=/usr/local/bin/ollama0.3.13 serve (code=exited, status=0/SUCCESS) Process: 16030 ExecStartPost=/usr/local/bin/ollama0.3.13 run llama3.1:70b-instruct-q2_K < /dev/null (code=exited, status=1/FAILURE) Main PID: 16029 (code=exited, status=0/SUCCESS) CPU: 3.789s Oct 11 00:45:42 kamala ollama0.3.13[16029]: llm_load_tensors: ROCm0 buffer size = 24817.00 MiB Oct 11 00:45:42 kamala ollama0.3.13[16029]: llm_load_tensors: ROCm_Host buffer size = 328.78 MiB Oct 11 00:45:42 kamala ollama0.3.13[16029]: time=2024-10-11T00:45:42.816Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Oct 11 00:45:43 kamala ollama0.3.13[16029]: time=2024-10-11T00:45:43.066Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Oct 11 00:45:43 kamala ollama0.3.13[16029]: [GIN] 2024/10/11 - 00:45:43 | 500 | 3.028948339s | 127.0.0.1 | POST "/api/generate" Oct 11 00:45:43 kamala ollama0.3.13[16030]: [836B blob data] Oct 11 00:45:43 kamala systemd[1]: ollama.service: Control process exited, code=exited, status=1/FAILURE Oct 11 00:45:43 kamala systemd[1]: ollama.service: Failed with result 'exit-code'. Oct 11 00:45:43 kamala systemd[1]: Failed to start ollama.service - Ollama Service. Oct 11 00:45:43 kamala systemd[1]: ollama.service: Consumed 3.789s CPU time. ```
Author
Owner

@ross-rosario commented on GitHub (Oct 12, 2024):

ollama v0.3.12

  • Linux vs. Windows - for linux, what distro and version

Arch Linux

  • What GPU

RX 7900 XTX

  • How much system RAM

32 GB

  • AMD Driver version
    • for Linux, was it the upstream kernel bundled amdgpu and if so, what kernel version

Kernel bundled driver. Kernel v6.11.3.

  • ROCm version - bundled from Ollama, or "bring your own" and if BYO, what version

BYO, v6.0.2. But the issue is reroducible with the official ollama build with its own rocm.

  • Which model causes a crash, and did you use any custom parameters or is a simple ollama run somemodel sufficient to trigger the crash

Just running this without custom parameters causes the crash ollama run command-r:35b-08-2024-q4_K_M

  • Any custom env vars that you're using

Nope.

Logs below

Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.430-06:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 gpu=0 parallel=1 available=24361586688 required="20.1 GiB"
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.430-06:00 level=INFO source=server.go:103 msg="system memory" total="30.5 GiB" free="19.2 GiB" free_swap="0 B"
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.436-06:00 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[22.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.1 GiB" memory.required.partial="20.1 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.1 GiB]" memory.weights.total="17.1 GiB" memory.weights.repeating="15.5 GiB" memory.weights.nonrepeating="1.6 GiB" memory.graph.full="516.0 MiB" memory.graph.partial="2.1 GiB"
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama122407284/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 43025"
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.440-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Oct 12 07:34:57 computer ollama[8361]: INFO [main] build info | build=3670 commit="84935033d" tid="123905119087680" timestamp=1728740097
Oct 12 07:34:57 computer ollama[8361]: INFO [main] system info | n_threads=16 n_threads_batch=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123905119087680" timestamp=1728740097 total_threads=32
Oct 12 07:34:57 computer ollama[8361]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43025" tid="123905119087680" timestamp=1728740097
Oct 12 07:34:57 computer ollama[2247]: time=2024-10-12T07:34:57.207-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: loaded meta data with 34 key-value pairs and 322 tensors from /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 (version GGUF V3 (latest))
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   0:                       general.architecture str              = command-r
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   1:                               general.type str              = model
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   2:                               general.name str              = C4Ai Command R 08 2024
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   3:                            general.version str              = 08-2024
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   4:                           general.basename str              = c4ai-command-r
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   5:                         general.size_label str              = 32B
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   6:                            general.license str              = cc-by-nc-4.0
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   7:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   8:                      command-r.block_count u32              = 40
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv   9:                   command-r.context_length u32              = 131072
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  10:                 command-r.embedding_length u32              = 8192
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  11:              command-r.feed_forward_length u32              = 24576
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  12:             command-r.attention.head_count u32              = 64
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  13:          command-r.attention.head_count_kv u32              = 8
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  14:                   command-r.rope.freq_base f32              = 4000000.000000
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  15:     command-r.attention.layer_norm_epsilon f32              = 0.000010
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  16:                          general.file_type u32              = 15
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  17:                      command-r.logit_scale f32              = 0.062500
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  18:                command-r.rope.scaling.type str              = none
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = command-r
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 5
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 255001
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 0
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  29:           tokenizer.chat_template.tool_use str              = {{ bos_token }}{% if messages[0]['rol...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  30:                tokenizer.chat_template.rag str              = {{ bos_token }}{% if messages[0]['rol...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  31:                   tokenizer.chat_templates arr[str,2]       = ["tool_use", "rag"]
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv  33:               general.quantization_version u32              = 2
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type  f32:   41 tensors
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type q4_K:  240 tensors
Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type q6_K:   41 tensors
Oct 12 07:34:57 computer ollama[2247]: llm_load_vocab: special tokens cache size = 42
Oct 12 07:34:58 computer ollama[2247]: llm_load_vocab: token to piece cache size = 1.8428 MB
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: format           = GGUF V3 (latest)
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: arch             = command-r
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: vocab type       = BPE
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_vocab          = 256000
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_merges         = 253333
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: vocab_only       = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ctx_train      = 131072
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd           = 8192
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_layer          = 40
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_head           = 64
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_head_kv        = 8
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_rot            = 128
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_swa            = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_head_k    = 128
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_head_v    = 128
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_gqa            = 8
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_k_gqa     = 1024
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_v_gqa     = 1024
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_logit_scale    = 6.2e-02
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ff             = 24576
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_expert         = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_expert_used    = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: causal attn      = 1
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: pooling type     = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope type        = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope scaling     = none
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: freq_base_train  = 4000000.0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: freq_scale_train = 1
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope_finetuned   = unknown
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_conv       = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_inner      = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_state      = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_dt_rank      = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model type       = 35B
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model ftype      = Q4_K - Medium
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model params     = 32.30 B
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model size       = 18.43 GiB (4.90 BPW)
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: general.name     = C4Ai Command R 08 2024
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: PAD token        = 0 '<PAD>'
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: LF token         = 136 'Ä'
Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: max token length = 1024
Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: found 1 ROCm devices:
Oct 12 07:35:03 computer ollama[2247]:   Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: ggml ctx size =    0.31 MiB
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloading 40 repeating layers to GPU
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloading non-repeating layers to GPU
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloaded 41/41 layers to GPU
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors:      ROCm0 buffer size = 18873.16 MiB
Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors:  ROCm_Host buffer size =  1640.62 MiB
Oct 12 07:35:04 computer ollama[2247]: time=2024-10-12T07:35:04.138-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
Oct 12 07:35:12 computer ollama[2247]: time=2024-10-12T07:35:12.958-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
Oct 12 07:35:12 computer ollama[2247]: [GIN] 2024/10/12 - 07:35:12 | 500 |  18.62111287s |       127.0.0.1 | POST     "/api/generate"
Oct 12 07:35:17 computer ollama[2247]: time=2024-10-12T07:35:17.959-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000688676 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656
Oct 12 07:35:18 computer ollama[2247]: time=2024-10-12T07:35:18.208-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250212765 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656
Oct 12 07:35:18 computer ollama[2247]: time=2024-10-12T07:35:18.459-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500758542 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656
<!-- gh-comment-id:2408569449 --> @ross-rosario commented on GitHub (Oct 12, 2024): `ollama v0.3.12` > * Linux vs. Windows - for linux, what distro and version Arch Linux > * What GPU RX 7900 XTX > * How much system RAM 32 GB > * AMD Driver version > * for Linux, was it the upstream kernel bundled amdgpu and if so, what kernel version Kernel bundled driver. Kernel v6.11.3. > * ROCm version - bundled from Ollama, or "bring your own" and if BYO, what version BYO, v6.0.2. But the issue is reroducible with the official ollama build with its own rocm. > * Which model causes a crash, and did you use any custom parameters or is a simple `ollama run somemodel` sufficient to trigger the crash Just running this without custom parameters causes the crash `ollama run command-r:35b-08-2024-q4_K_M` > * Any custom env vars that you're using Nope. Logs below ``` Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.430-06:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 gpu=0 parallel=1 available=24361586688 required="20.1 GiB" Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.430-06:00 level=INFO source=server.go:103 msg="system memory" total="30.5 GiB" free="19.2 GiB" free_swap="0 B" Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.436-06:00 level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[22.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="20.1 GiB" memory.required.partial="20.1 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[20.1 GiB]" memory.weights.total="17.1 GiB" memory.weights.repeating="15.5 GiB" memory.weights.nonrepeating="1.6 GiB" memory.graph.full="516.0 MiB" memory.graph.partial="2.1 GiB" Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama122407284/runners/rocm/ollama_llama_server --model /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --no-mmap --parallel 1 --port 43025" Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.439-06:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Oct 12 07:34:54 computer ollama[2247]: time=2024-10-12T07:34:54.440-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Oct 12 07:34:57 computer ollama[8361]: INFO [main] build info | build=3670 commit="84935033d" tid="123905119087680" timestamp=1728740097 Oct 12 07:34:57 computer ollama[8361]: INFO [main] system info | n_threads=16 n_threads_batch=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123905119087680" timestamp=1728740097 total_threads=32 Oct 12 07:34:57 computer ollama[8361]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="43025" tid="123905119087680" timestamp=1728740097 Oct 12 07:34:57 computer ollama[2247]: time=2024-10-12T07:34:57.207-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: loaded meta data with 34 key-value pairs and 322 tensors from /var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 (version GGUF V3 (latest)) Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 0: general.architecture str = command-r Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 1: general.type str = model Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 2: general.name str = C4Ai Command R 08 2024 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 3: general.version str = 08-2024 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 4: general.basename str = c4ai-command-r Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 5: general.size_label str = 32B Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 6: general.license str = cc-by-nc-4.0 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 7: general.languages arr[str,10] = ["en", "fr", "de", "es", "it", "pt", ... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 8: command-r.block_count u32 = 40 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 9: command-r.context_length u32 = 131072 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 10: command-r.embedding_length u32 = 8192 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 11: command-r.feed_forward_length u32 = 24576 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 12: command-r.attention.head_count u32 = 64 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 13: command-r.attention.head_count_kv u32 = 8 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 14: command-r.rope.freq_base f32 = 4000000.000000 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 15: command-r.attention.layer_norm_epsilon f32 = 0.000010 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 16: general.file_type u32 = 15 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 17: command-r.logit_scale f32 = 0.062500 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 18: command-r.rope.scaling.type str = none Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 20: tokenizer.ggml.pre str = command-r Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 5 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 255001 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 0 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = true Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 28: tokenizer.ggml.add_eos_token bool = false Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 29: tokenizer.chat_template.tool_use str = {{ bos_token }}{% if messages[0]['rol... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 30: tokenizer.chat_template.rag str = {{ bos_token }}{% if messages[0]['rol... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 31: tokenizer.chat_templates arr[str,2] = ["tool_use", "rag"] Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - kv 33: general.quantization_version u32 = 2 Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type f32: 41 tensors Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type q4_K: 240 tensors Oct 12 07:34:57 computer ollama[2247]: llama_model_loader: - type q6_K: 41 tensors Oct 12 07:34:57 computer ollama[2247]: llm_load_vocab: special tokens cache size = 42 Oct 12 07:34:58 computer ollama[2247]: llm_load_vocab: token to piece cache size = 1.8428 MB Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: format = GGUF V3 (latest) Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: arch = command-r Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: vocab type = BPE Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_vocab = 256000 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_merges = 253333 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: vocab_only = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ctx_train = 131072 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd = 8192 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_layer = 40 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_head = 64 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_head_kv = 8 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_rot = 128 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_swa = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_head_k = 128 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_head_v = 128 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_gqa = 8 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_k_gqa = 1024 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_embd_v_gqa = 1024 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_norm_eps = 1.0e-05 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_norm_rms_eps = 0.0e+00 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: f_logit_scale = 6.2e-02 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ff = 24576 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_expert = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_expert_used = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: causal attn = 1 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: pooling type = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope type = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope scaling = none Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: freq_base_train = 4000000.0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: freq_scale_train = 1 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: rope_finetuned = unknown Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_conv = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_inner = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_d_state = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_dt_rank = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model type = 35B Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model ftype = Q4_K - Medium Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model params = 32.30 B Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: model size = 18.43 GiB (4.90 BPW) Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: general.name = C4Ai Command R 08 2024 Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>' Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>' Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: PAD token = 0 '<PAD>' Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: LF token = 136 'Ä' Oct 12 07:34:58 computer ollama[2247]: llm_load_print_meta: max token length = 1024 Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Oct 12 07:35:03 computer ollama[2247]: ggml_cuda_init: found 1 ROCm devices: Oct 12 07:35:03 computer ollama[2247]: Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: ggml ctx size = 0.31 MiB Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloading 40 repeating layers to GPU Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloading non-repeating layers to GPU Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: offloaded 41/41 layers to GPU Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: ROCm0 buffer size = 18873.16 MiB Oct 12 07:35:03 computer ollama[2247]: llm_load_tensors: ROCm_Host buffer size = 1640.62 MiB Oct 12 07:35:04 computer ollama[2247]: time=2024-10-12T07:35:04.138-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" Oct 12 07:35:12 computer ollama[2247]: time=2024-10-12T07:35:12.958-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" Oct 12 07:35:12 computer ollama[2247]: [GIN] 2024/10/12 - 07:35:12 | 500 | 18.62111287s | 127.0.0.1 | POST "/api/generate" Oct 12 07:35:17 computer ollama[2247]: time=2024-10-12T07:35:17.959-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.000688676 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 Oct 12 07:35:18 computer ollama[2247]: time=2024-10-12T07:35:18.208-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250212765 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 Oct 12 07:35:18 computer ollama[2247]: time=2024-10-12T07:35:18.459-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500758542 model=/var/lib/ollama/blobs/sha256-c264671c5e5347f66317f2d92db914e46dd242de0fe50f5a534991a0a15a2656 ```
Author
Owner

@paulchevalier commented on GitHub (Oct 12, 2024):

Is this solved for everyone? I am still getting the same error.
I just tried with the git version (by using AUR package ollama-rocm-git)

In parrallel, I updated to 32GB system ram so that I could run models on the CPU while doing other things and I am still getting the error.
I can still run many models on CPU.

ollama_2024_10_12.log

<!-- gh-comment-id:2408740277 --> @paulchevalier commented on GitHub (Oct 12, 2024): Is this solved for everyone? I am still getting the same error. I just tried with the git version (by using AUR package ollama-rocm-git) In parrallel, I updated to 32GB system ram so that I could run models on the CPU while doing other things and I am still getting the error. I can still run many models on CPU. [ollama_2024_10_12.log](https://github.com/user-attachments/files/17351768/ollama_2024_10_12.log)
Author
Owner

@unclemusclez commented on GitHub (Oct 12, 2024):

Is this solved for everyone? I am still getting the same error. I just tried with the git version (by using AUR package ollama-rocm-git)

In parrallel, I updated to 32GB system ram so that I could run models on the CPU while doing other things and I am still getting the error. I can still run many models on CPU.

ollama_2024_10_12.log

@paulchevalier this is solved in the main branch currently when you git pull.

<!-- gh-comment-id:2408744685 --> @unclemusclez commented on GitHub (Oct 12, 2024): > Is this solved for everyone? I am still getting the same error. I just tried with the git version (by using AUR package ollama-rocm-git) > > In parrallel, I updated to 32GB system ram so that I could run models on the CPU while doing other things and I am still getting the error. I can still run many models on CPU. > > [ollama_2024_10_12.log](https://github.com/user-attachments/files/17351768/ollama_2024_10_12.log) @paulchevalier this is solved in the main branch currently when you git pull.
Author
Owner

@dhiltgen commented on GitHub (Oct 14, 2024):

It's possible there might be a couple distinct root causes in here, but the main issue should be resolved in 0.3.14 when we release that in the coming days. If anyone still hits their failures on that release let us know and we'll continue to investigate.

<!-- gh-comment-id:2411718587 --> @dhiltgen commented on GitHub (Oct 14, 2024): It's possible there might be a couple distinct root causes in here, but the main issue should be resolved in 0.3.14 when we release that in the coming days. If anyone still hits their failures on that release let us know and we'll continue to investigate.
Author
Owner

@walmartbaggg commented on GitHub (Oct 18, 2024):

I am currently getting the same issues with the Instinct Mi60, I can run deepseek v2 16b, and many other models. But once it reaches 27b (tested, gemma 27b) it causes segmentation error.

I did not try old versions, but I will. I have also tried to set OLLAMA_GPU_OVERHEAD. No difference.

latest ollama version*

<!-- gh-comment-id:2422679659 --> @walmartbaggg commented on GitHub (Oct 18, 2024): I am currently getting the same issues with the Instinct Mi60, I can run deepseek v2 16b, and many other models. But once it reaches 27b (tested, gemma 27b) it causes segmentation error. I did not try old versions, but I will. I have also tried to set OLLAMA_GPU_OVERHEAD. No difference. latest ollama version*
Author
Owner

@walmartbaggg commented on GitHub (Oct 18, 2024):

It's possible there might be a couple distinct root causes in here, but the main issue should be resolved in 0.3.14 when we release that in the coming days. If anyone still hits their failures on that release let us know and we'll continue to investigate.

Oops didnt see this. Hopefully it does fix it.

<!-- gh-comment-id:2422681166 --> @walmartbaggg commented on GitHub (Oct 18, 2024): > It's possible there might be a couple distinct root causes in here, but the main issue should be resolved in 0.3.14 when we release that in the coming days. If anyone still hits their failures on that release let us know and we'll continue to investigate. Oops didnt see this. Hopefully it does fix it.
Author
Owner

@ross-rosario commented on GitHub (Oct 20, 2024):

Ollama 0.3.14-rc0 works great! Thanks @dhiltgen !

<!-- gh-comment-id:2425260374 --> @ross-rosario commented on GitHub (Oct 20, 2024): Ollama `0.3.14-rc0` works great! Thanks @dhiltgen !
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50770