[GH-ISSUE #2564] Ollama crashes on CUBLAS_STATUS_NOT_SUPPORTED While loading Falcon model #48017

Closed
opened 2026-04-28 06:26:41 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @keesj-riscure on GitHub (Feb 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2564

I just upgraded to the latest ollama to verify the issue and it it still present on my hardware

I am running version 0.1.25 and trying to run the falcon model

Warning: could not connect to a running Ollama instance
Warning: client version is 0.1.25

time=2024-02-17T17:04:21.062+01:00 level=INFO source=images.go:706 msg="total blobs: 37"
time=2024-02-17T17:04:21.063+01:00 level=INFO source=images.go:713 msg="total unused blobs removed: 0"
time=2024-02-17T17:04:21.064+01:00 level=INFO source=routes.go:1014 msg="Listening on 127.0.0.1:11434 (version 0.1.25)"
time=2024-02-17T17:04:21.064+01:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-17T17:04:24.780+01:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu_avx cpu cuda_v11 rocm_v5 rocm_v6]"
time=2024-02-17T17:04:24.781+01:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-17T17:04:24.781+01:00 level=INFO source=gpu.go:262 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-17T17:04:24.782+01:00 level=INFO source=gpu.go:308 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01]"
time=2024-02-17T17:04:24.784+01:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-02-17T17:04:24.784+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-17T17:04:24.795+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
[GIN] 2024/02/17 - 17:04:42 | 200 |      55.124µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/02/17 - 17:04:42 | 200 |     838.883µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/02/17 - 17:04:42 | 200 |     338.685µs |       127.0.0.1 | POST     "/api/show"
time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-17T17:04:43.129+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-17T17:04:43.129+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama1967682888/cuda_v11/libext_server.so
time=2024-02-17T17:04:43.140+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1967682888/cuda_v11/libext_server.so"
time=2024-02-17T17:04:43.140+01:00 level=INFO source=dyn_ext_server.go:145 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from /home/keesj/.ollama/models/blobs/sha256:305c4103a989d3f8ac457f912af30f32693f20dcffe1495e18c2ed7b5596b2d1 (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = falcon
llama_model_loader: - kv   1:                               general.name str              = Falcon
llama_model_loader: - kv   2:                      falcon.context_length u32              = 2048
llama_model_loader: - kv   3:                  falcon.tensor_data_layout str              = jploski
llama_model_loader: - kv   4:                    falcon.embedding_length u32              = 4544
llama_model_loader: - kv   5:                 falcon.feed_forward_length u32              = 18176
llama_model_loader: - kv   6:                         falcon.block_count u32              = 32
llama_model_loader: - kv   7:                falcon.attention.head_count u32              = 71
llama_model_loader: - kv   8:             falcon.attention.head_count_kv u32              = 1
llama_model_loader: - kv   9:        falcon.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,65024]   = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,65024]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,65024]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,64784]   = ["Ġ t", "Ġ a", "i n", "h e", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 11
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q8_0:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 12/65024 vs 0/65024 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = falcon
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 65024
llm_load_print_meta: n_merges         = 64784
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4544
llm_load_print_meta: n_head           = 71
llm_load_print_meta: n_head_kv        = 1
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 71
llm_load_print_meta: n_embd_k_gqa     = 64
llm_load_print_meta: n_embd_v_gqa     = 64
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 18176
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.22 B
llm_load_print_meta: model size       = 3.92 GiB (4.66 BPW)
llm_load_print_meta: general.name     = Falcon
llm_load_print_meta: BOS token        = 11 '<|endoftext|>'
llm_load_print_meta: EOS token        = 11 '<|endoftext|>'
llm_load_print_meta: LF token         = 138 'Ä'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   158.50 MiB
llm_load_tensors:      CUDA0 buffer size =  1888.89 MiB
llm_load_tensors:      CUDA1 buffer size =  1966.10 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =     8.50 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =     7.50 MiB
llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    13.89 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   332.63 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   332.63 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.88 MiB
llama_new_context_with_model: graph splits (measure): 5
time=2024-02-17T17:04:45.088+01:00 level=INFO source=dyn_ext_server.go:156 msg="Starting llama main loop"
[GIN] 2024/02/17 - 17:04:45 | 200 |  2.677575437s |       127.0.0.1 | POST     "/api/chat"
CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9889
  cublasGemmBatchedEx(g_cublas_handles[g_main_device], CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:241: !"CUDA error"
Originally created by @keesj-riscure on GitHub (Feb 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2564 I just upgraded to the latest ollama to verify the issue and it it still present on my hardware I am running version 0.1.25 and trying to run the falcon model Warning: could not connect to a running Ollama instance Warning: client version is 0.1.25 ``` time=2024-02-17T17:04:21.062+01:00 level=INFO source=images.go:706 msg="total blobs: 37" time=2024-02-17T17:04:21.063+01:00 level=INFO source=images.go:713 msg="total unused blobs removed: 0" time=2024-02-17T17:04:21.064+01:00 level=INFO source=routes.go:1014 msg="Listening on 127.0.0.1:11434 (version 0.1.25)" time=2024-02-17T17:04:21.064+01:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-02-17T17:04:24.780+01:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu_avx cpu cuda_v11 rocm_v5 rocm_v6]" time=2024-02-17T17:04:24.781+01:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-17T17:04:24.781+01:00 level=INFO source=gpu.go:262 msg="Searching for GPU management library libnvidia-ml.so" time=2024-02-17T17:04:24.782+01:00 level=INFO source=gpu.go:308 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01]" time=2024-02-17T17:04:24.784+01:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-17T17:04:24.784+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-17T17:04:24.795+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" [GIN] 2024/02/17 - 17:04:42 | 200 | 55.124µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/17 - 17:04:42 | 200 | 838.883µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/17 - 17:04:42 | 200 | 338.685µs | 127.0.0.1 | POST "/api/show" time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-17T17:04:43.129+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-17T17:04:43.129+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-02-17T17:04:43.129+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" loading library /tmp/ollama1967682888/cuda_v11/libext_server.so time=2024-02-17T17:04:43.140+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1967682888/cuda_v11/libext_server.so" time=2024-02-17T17:04:43.140+01:00 level=INFO source=dyn_ext_server.go:145 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes ggml_init_cublas: CUDA_USE_TENSOR_CORES: no ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 18 key-value pairs and 196 tensors from /home/keesj/.ollama/models/blobs/sha256:305c4103a989d3f8ac457f912af30f32693f20dcffe1495e18c2ed7b5596b2d1 (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = falcon llama_model_loader: - kv 1: general.name str = Falcon llama_model_loader: - kv 2: falcon.context_length u32 = 2048 llama_model_loader: - kv 3: falcon.tensor_data_layout str = jploski llama_model_loader: - kv 4: falcon.embedding_length u32 = 4544 llama_model_loader: - kv 5: falcon.feed_forward_length u32 = 18176 llama_model_loader: - kv 6: falcon.block_count u32 = 32 llama_model_loader: - kv 7: falcon.attention.head_count u32 = 71 llama_model_loader: - kv 8: falcon.attention.head_count_kv u32 = 1 llama_model_loader: - kv 9: falcon.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,65024] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,65024] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,65024] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,64784] = ["Ġ t", "Ġ a", "i n", "h e", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 11 llama_model_loader: - kv 17: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 129 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 12/65024 vs 0/65024 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = falcon llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 65024 llm_load_print_meta: n_merges = 64784 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 4544 llm_load_print_meta: n_head = 71 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 71 llm_load_print_meta: n_embd_k_gqa = 64 llm_load_print_meta: n_embd_v_gqa = 64 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 18176 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.22 B llm_load_print_meta: model size = 3.92 GiB (4.66 BPW) llm_load_print_meta: general.name = Falcon llm_load_print_meta: BOS token = 11 '<|endoftext|>' llm_load_print_meta: EOS token = 11 '<|endoftext|>' llm_load_print_meta: LF token = 138 'Ä' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 158.50 MiB llm_load_tensors: CUDA0 buffer size = 1888.89 MiB llm_load_tensors: CUDA1 buffer size = 1966.10 MiB .................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 8.50 MiB llama_kv_cache_init: CUDA1 KV buffer size = 7.50 MiB llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.89 MiB llama_new_context_with_model: CUDA0 compute buffer size = 332.63 MiB llama_new_context_with_model: CUDA1 compute buffer size = 332.63 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.88 MiB llama_new_context_with_model: graph splits (measure): 5 time=2024-02-17T17:04:45.088+01:00 level=INFO source=dyn_ext_server.go:156 msg="Starting llama main loop" [GIN] 2024/02/17 - 17:04:45 | 200 | 2.677575437s | 127.0.0.1 | POST "/api/chat" CUDA error: CUBLAS_STATUS_NOT_SUPPORTED current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:9889 cublasGemmBatchedEx(g_cublas_handles[g_main_device], CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:241: !"CUDA error" ```
Author
Owner

@VincentJGeisler commented on GitHub (Mar 29, 2024):

yup, got the same problem.

CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:10604
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"

it seems to be a problem in the falcon:7b model specifically, 40b and 180b seems to work.

also falcon:latest works, but any of the 7b models, even trying to pull directly from hugging face fails.

<!-- gh-comment-id:2027558957 --> @VincentJGeisler commented on GitHub (Mar 29, 2024): yup, got the same problem. CUDA error: CUBLAS_STATUS_NOT_SUPPORTED current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:10604 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error" it seems to be a problem in the falcon:7b model specifically, 40b and 180b seems to work. also falcon:latest works, but any of the 7b models, even trying to pull directly from hugging face fails.
Author
Owner

@aminalaghband commented on GitHub (Apr 30, 2024):

CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
time=2024-04-30T21:57:12.409Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error""

<!-- gh-comment-id:2087510343 --> @aminalaghband commented on GitHub (Apr 30, 2024): CUDA error: CUBLAS_STATUS_NOT_INITIALIZED current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" time=2024-04-30T21:57:12.409Z level=ERROR source=routes.go:120 msg="error loading llama server" error="llama runner process no longer running: -1 CUDA error: CUBLAS_STATUS_NOT_INITIALIZED\n current device: 0, in function cublas_handle at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda/common.cuh:526\n cublasCreate_v2(&cublas_handles[device])\nGGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !\"CUDA error\""
Author
Owner

@towhidultonmoy commented on GitHub (Jun 11, 2024):

Getting the same error with Falcon &b models using Ollama GPU version: error: an unknown error was encountered while running the model CUDA error: CUBLAS_STATUS_NOT_SUPPORTED

<!-- gh-comment-id:2159893135 --> @towhidultonmoy commented on GitHub (Jun 11, 2024): Getting the same error with Falcon &b models using Ollama GPU version: error: an unknown error was encountered while running the model CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
Author
Owner

@drdsgvo commented on GitHub (Jul 8, 2024):

Same error with ollama 0.1.47 and contexts longer than some few characters (few thousands?) With short contexts the problem did not appear or at least sometimes there was no crash.

<!-- gh-comment-id:2214797491 --> @drdsgvo commented on GitHub (Jul 8, 2024): Same error with ollama 0.1.47 and contexts longer than some few characters (few thousands?) With short contexts the problem did not appear or at least sometimes there was no crash.
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

The falcon model architecture is no longer supported, but falcon2 will work. We've updated the model page to indicate it is no longer supported on the latest versions of Ollama.

<!-- gh-comment-id:2249000281 --> @dhiltgen commented on GitHub (Jul 24, 2024): The falcon model architecture is no longer supported, but [falcon2](https://ollama.com/library/falcon2) will work. We've updated the [model page](https://ollama.com/library/falcon) to indicate it is no longer supported on the latest versions of Ollama.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48017