[GH-ISSUE #4333] segmentation fault when running codellama:34b on A100 #2697

Closed
opened 2026-04-12 13:01:11 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @jmorganca on GitHub (May 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4333

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

CLI:

$ ollama run codellama:34b
Error: llama runner process has terminated: signal: segmentation fault

Logs:

May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.033Z level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=49 memory.available="39.0 GiB" memory.required.full="19.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="384.0 MiB" memory.weights.total="18.0 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB"
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.034Z level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=49 memory.available="39.0 GiB" memory.required.full="19.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="384.0 MiB" memory.weights.total="18.0 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB"
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:308 msg="starting llama server" cmd="/tmp/ollama944909272/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f36b668ebcd329357fac22db35f6414a1c9309307f33d08fe217bbf84b0496cc --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 1 --port 36091"
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=sched.go:333 msg="loaded runners" count=1
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:478 msg="waiting for llama runner to start responding"
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:514 msg="waiting for server to become available" status="llm server error"
May 11 02:47:28 gpu ollama[28220]: INFO [main] build info | build=1 commit="952d03d" tid="140151386750976" timestamp=1715395648
May 11 02:47:28 gpu ollama[28220]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140151386750976" timestamp=1715395648 total_threads=12
May 11 02:47:28 gpu ollama[28220]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="36091" tid="140151386750976" timestamp=1715395648
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f36b668ebcd329357fac22db35f6414a1c9309307f33d08fe217bbf84b0496cc (version GGUF V2)
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   0:                       general.architecture str              = llama
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   1:                               general.name str              = codellama
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   4:                          llama.block_count u32              = 48
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv  19:               general.quantization_version u32              = 2
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type  f32:   97 tensors
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type q4_0:  337 tensors
May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type q6_K:    1 tensors
May 11 02:47:28 gpu ollama[27286]: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: format           = GGUF V2
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: arch             = llama
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: vocab type       = SPM
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_vocab          = 32000
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_merges         = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_ctx_train      = 16384
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd           = 8192
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_head           = 64
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_head_kv        = 8
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_layer          = 48
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_rot            = 128
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_head_k    = 128
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_head_v    = 128
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_gqa            = 8
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_k_gqa     = 1024
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_v_gqa     = 1024
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_ff             = 22016
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_expert         = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_expert_used    = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: causal attn      = 1
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: pooling type     = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope type        = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope scaling     = linear
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: freq_base_train  = 1000000.0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: freq_scale_train = 1
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_yarn_orig_ctx  = 16384
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope_finetuned   = unknown
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_conv       = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_inner      = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_state      = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_dt_rank      = 0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model type       = 34B
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model ftype      = Q4_0
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model params     = 33.74 B
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model size       = 17.74 GiB (4.52 BPW)
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: general.name     = codellama
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: BOS token        = 1 '<s>'
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: EOS token        = 2 '</s>'
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: UNK token        = 0 '<unk>'
May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: LF token         = 13 '<0x0A>'
May 11 02:47:28 gpu ollama[27286]: [52B blob data]
May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.286Z level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault "
May 11 02:47:28 gpu ollama[27286]: [GIN] 2024/05/11 - 02:47:28 | 500 |  1.242539308s |       127.0.0.1 | POST     "/api/chat"
May 11 02:47:30 gpu ollama[27286]: time=2024-05-11T02:47:30.881Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.22935882
May 11 02:47:31 gpu ollama[27286]: time=2024-05-11T02:47:31.211Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.559953212
May 11 02:47:31 gpu ollama[27286]: time=2024-05-11T02:47:31.542Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.890430607
May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.102Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.221611086
May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.434Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.552699521
May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.764Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.883013969

OS

Linux

GPU

NVIDIA A100 40GB

CPU

Intel

Ollama version

0.1.35

Originally created by @jmorganca on GitHub (May 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4333 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? CLI: ``` $ ollama run codellama:34b Error: llama runner process has terminated: signal: segmentation fault ``` Logs: ``` May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.033Z level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=49 memory.available="39.0 GiB" memory.required.full="19.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="384.0 MiB" memory.weights.total="18.0 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB" May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.034Z level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=49 memory.available="39.0 GiB" memory.required.full="19.1 GiB" memory.required.partial="19.1 GiB" memory.required.kv="384.0 MiB" memory.weights.total="18.0 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB" May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:308 msg="starting llama server" cmd="/tmp/ollama944909272/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-f36b668ebcd329357fac22db35f6414a1c9309307f33d08fe217bbf84b0496cc --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 49 --parallel 1 --port 36091" May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=sched.go:333 msg="loaded runners" count=1 May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:478 msg="waiting for llama runner to start responding" May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.035Z level=INFO source=server.go:514 msg="waiting for server to become available" status="llm server error" May 11 02:47:28 gpu ollama[28220]: INFO [main] build info | build=1 commit="952d03d" tid="140151386750976" timestamp=1715395648 May 11 02:47:28 gpu ollama[28220]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140151386750976" timestamp=1715395648 total_threads=12 May 11 02:47:28 gpu ollama[28220]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="36091" tid="140151386750976" timestamp=1715395648 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-f36b668ebcd329357fac22db35f6414a1c9309307f33d08fe217bbf84b0496cc (version GGUF V2) May 11 02:47:28 gpu ollama[27286]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 0: general.architecture str = llama May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 1: general.name str = codellama May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 2: llama.context_length u32 = 16384 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 3: llama.embedding_length u32 = 8192 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 4: llama.block_count u32 = 48 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 22016 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 64 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 11: general.file_type u32 = 2 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type f32: 97 tensors May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type q4_0: 337 tensors May 11 02:47:28 gpu ollama[27286]: llama_model_loader: - type q6_K: 1 tensors May 11 02:47:28 gpu ollama[27286]: llm_load_vocab: special tokens definition check successful ( 259/32000 ). May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: format = GGUF V2 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: arch = llama May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: vocab type = SPM May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_vocab = 32000 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_merges = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_ctx_train = 16384 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd = 8192 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_head = 64 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_head_kv = 8 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_layer = 48 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_rot = 128 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_head_k = 128 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_head_v = 128 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_gqa = 8 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_k_gqa = 1024 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_embd_v_gqa = 1024 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_norm_eps = 0.0e+00 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: f_logit_scale = 0.0e+00 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_ff = 22016 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_expert = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_expert_used = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: causal attn = 1 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: pooling type = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope type = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope scaling = linear May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: freq_base_train = 1000000.0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: freq_scale_train = 1 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: n_yarn_orig_ctx = 16384 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: rope_finetuned = unknown May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_conv = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_inner = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_d_state = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: ssm_dt_rank = 0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model type = 34B May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model ftype = Q4_0 May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model params = 33.74 B May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: model size = 17.74 GiB (4.52 BPW) May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: general.name = codellama May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: BOS token = 1 '<s>' May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: EOS token = 2 '</s>' May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: UNK token = 0 '<unk>' May 11 02:47:28 gpu ollama[27286]: llm_load_print_meta: LF token = 13 '<0x0A>' May 11 02:47:28 gpu ollama[27286]: [52B blob data] May 11 02:47:28 gpu ollama[27286]: time=2024-05-11T02:47:28.286Z level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault " May 11 02:47:28 gpu ollama[27286]: [GIN] 2024/05/11 - 02:47:28 | 500 | 1.242539308s | 127.0.0.1 | POST "/api/chat" May 11 02:47:30 gpu ollama[27286]: time=2024-05-11T02:47:30.881Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.22935882 May 11 02:47:31 gpu ollama[27286]: time=2024-05-11T02:47:31.211Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.559953212 May 11 02:47:31 gpu ollama[27286]: time=2024-05-11T02:47:31.542Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.890430607 May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.102Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.221611086 May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.434Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.552699521 May 11 02:47:36 gpu ollama[27286]: time=2024-05-11T02:47:36.764Z level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.883013969 ``` ### OS Linux ### GPU NVIDIA A100 40GB ### CPU Intel ### Ollama version 0.1.35
GiteaMirror added the buggpunvidia labels 2026-04-12 13:01:11 -05:00
Author
Owner

@akulbe commented on GitHub (May 16, 2024):

This is also happening in my setup.

OS: Linux Fedora 39
GPU: 2 x RTX 3090s
CPU: AMD
Ollama version: 0.1.38

<!-- gh-comment-id:2116059501 --> @akulbe commented on GitHub (May 16, 2024): This is also happening in my setup. **OS:** Linux Fedora 39 **GPU:** 2 x RTX 3090s **CPU:** AMD **Ollama version:** 0.1.38
Author
Owner

@Yandrik commented on GitHub (May 23, 2024):

Same here

OS: Linux (Fedora 39)
GPU: 2 x RTX 4090s
CPU: AMD
Ollama version: 0.1.38

<!-- gh-comment-id:2127783468 --> @Yandrik commented on GitHub (May 23, 2024): Same here **OS**: Linux (Fedora 39) **GPU**: 2 x RTX 4090s **CPU**: AMD **Ollama version**: 0.1.38
Author
Owner

@AlexanderZhk commented on GitHub (May 26, 2024):

same. Phind-codellama in fp16 from the repository works, when loaded from a gguf doesn't though
(ubuntu, docker, A100, ollama 0.1.37)

<!-- gh-comment-id:2132407327 --> @AlexanderZhk commented on GitHub (May 26, 2024): same. Phind-codellama in fp16 from the repository works, when loaded from a gguf doesn't though (ubuntu, docker, A100, ollama 0.1.37)
Author
Owner

@rockoo commented on GitHub (May 31, 2024):

Same issue:

Running
OS: Ubuntu 22.04
GPU: RTX 3090
CPU: AMD Ryzen

<!-- gh-comment-id:2142885615 --> @rockoo commented on GitHub (May 31, 2024): Same issue: Running OS: Ubuntu 22.04 GPU: RTX 3090 CPU: AMD Ryzen
Author
Owner

@mhgrove commented on GitHub (Jun 7, 2024):

I am seeing this as well, ubuntu 23.10. started happening after i updated ollama from 0.1.33 to 0.1.41. Every codellama:34b model i've tried (34b, 34b-code, 34b-instruct-34b-python, 34b-q8_0) core dumps when i attempt to use it. 13b works, 70b works. none of the 34s work.

GPU: RTX Ada6000

<!-- gh-comment-id:2154705124 --> @mhgrove commented on GitHub (Jun 7, 2024): I am seeing this as well, ubuntu 23.10. started happening after i updated ollama from 0.1.33 to 0.1.41. Every codellama:34b model i've tried (34b, 34b-code, 34b-instruct-34b-python, 34b-q8_0) core dumps when i attempt to use it. 13b works, 70b works. none of the 34s work. GPU: RTX Ada6000
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

I believe this has been fixed in recent releases.

@mhgrove I haven't been able to repro on v0.2.7 on an A6000 Ada. The 34b models loads correctly now.

If anyone is still seeing crashes with codellama:34b models, please make sure to upgrade to the latest version, and if that doesn't resolve it, please share your server log, and more information about your setup (custom options like context size, etc.) and I'll reopen.

<!-- gh-comment-id:2243525027 --> @dhiltgen commented on GitHub (Jul 22, 2024): I believe this has been fixed in recent releases. @mhgrove I haven't been able to repro on v0.2.7 on an A6000 Ada. The 34b models loads correctly now. If anyone is still seeing crashes with codellama:34b models, please make sure to upgrade to the latest version, and if that doesn't resolve it, please share your server log, and more information about your setup (custom options like context size, etc.) and I'll reopen.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2697