[GH-ISSUE #1566] Error: llama runner exited, you may not have enough available memory to run this model #47369

Closed
opened 2026-04-28 03:38:14 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @baardove on GitHub (Dec 16, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1566

Hi,

When I have run a modell and try to communicate with it, I always get same response, no matter which model (or small or big)...
'
Error: llama runner exited, you may not have enough available memory to run this model
'

Any clues on this one?

My host is running ubuntu 20.04 on proxmox with approx 56 gb memory free, nvidia m40 24 gb gpu
'
free
total used free shared buff/cache available
Mem: 58212660 641572 54462900 5692 3108188 56950236
Swap: 8388604 0 8388604
'

'
nvidia-smi
Sat Dec 16 19:39:44 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla M40 24GB Off | 00000000:01:00.0 Off | 0 |
| N/A 37C P8 16W / 250W | 0MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
'

Seems ollama finds the gpu:
journalctl:
`
Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:300: 22939 MB VRAM available, loading up to 150 GPU layers
Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:436: starting llama runner
Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:494: waiting for llama runner to start responding
Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: found 1 CUDA devices:
Dec 16 18:30:05 tesla ollama[2245]: Device 0: Tesla M40 24GB, compute capability 5.2
Dec 16 18:30:05 tesla ollama[2326]: {"timestamp":1702751405,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff1>
Dec 16 18:30:05 tesla ollama[2326]: {"timestamp":1702751405,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":8,"n_threads_ba>
Dec 16 18:30:05 tesla ollama[2245]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs>


Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: ggml ctx size = 0.12 MiB
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: using CUDA for GPU acceleration
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: mem required = 70.43 MiB
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloading 32 repeating layers to GPU
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloading non-repeating layers to GPU
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloaded 33/33 layers to GPU
Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: VRAM used: 3577.56 MiB
`

Loading a modell works fine, but error comes when trying to communicate, happens with any modell, even the smallest.
Error: llama runner exited, you may not have enough available memory to run this model

journalctl:
'
Dec 16 18:31:50 tesla ollama[2245]: ..................................................................................................
Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: n_ctx = 4096
Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: freq_base = 10000.0
Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: freq_scale = 1
Dec 16 18:31:51 tesla ollama[2245]: llama_kv_cache_init: VRAM kv self = 2048.00 MB
Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
Dec 16 18:31:51 tesla ollama[2245]: llama_build_graph: non-view tensors processed: 676/676
Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: compute buffer total size = 291.32 MiB
Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB
Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: total VRAM used: 5913.57 MiB (model: 3577.56 MiB, context: 2336.00 MiB)
Dec 16 18:31:51 tesla ollama[2588]: {"timestamp":1702751511,"level":"INFO","function":"main","line":3035,"message":"HTTP server listening","hostname":"127.0>
Dec 16 18:31:51 tesla ollama[2588]: {"timestamp":1702751511,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"12>
Dec 16 18:31:51 tesla ollama[2245]: 2023/12/16 18:31:51 llama.go:508: llama runner started in 2.201689 seconds
Dec 16 18:31:51 tesla ollama[2245]: [GIN] 2023/12/16 - 18:31:51 | 200 | 2.311479662s | 127.0.0.1 | POST "/api/generate"
Dec 16 18:32:14 tesla ollama[2588]: {"timestamp":1702751534,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"12>
Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:577: loaded 0 images
Dec 16 18:32:14 tesla ollama[2245]: cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8448
Dec 16 18:32:14 tesla ollama[2245]: current device: 0
Dec 16 18:32:14 tesla ollama[2245]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8448: !"cuBLAS error"
Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:451: signal: aborted (core dumped)
Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:525: llama runner stopped successfully
Dec 16 18:32:14 tesla ollama[2245]: [GIN] 2023/12/16 - 18:32:14 | 200 | 601.813679ms | 127.0.0.1 | POST "/api/generate"
'

Full log:
https://www.evernote.com/shard/s16/sh/6d2eab19-c11f-7cf4-148c-9a5cd04dc944/Zwy3R7zsW8TvzDquK5Devnpko4BPwqNquvDt4nHLGCiecB_luwmk3sH8ug

The gpu is a bit dated, so it might miss some features newer nvidia cards have. It is a affordable option to run with a lot of vram so would be nice if it was supported.

When running ComfyUI i have to start with --disable-cuda-malloc.

Regards,

Bård Ove Myhr

Originally created by @baardove on GitHub (Dec 16, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1566 Hi, When I have run a modell and try to communicate with it, I always get same response, no matter which model (or small or big)... ' Error: llama runner exited, you may not have enough available memory to run this model ' Any clues on this one? My host is running ubuntu 20.04 on proxmox with approx 56 gb memory free, nvidia m40 24 gb gpu ' free total used free shared buff/cache available Mem: 58212660 641572 54462900 5692 3108188 56950236 Swap: 8388604 0 8388604 ' ' nvidia-smi Sat Dec 16 19:39:44 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla M40 24GB Off | 00000000:01:00.0 Off | 0 | | N/A 37C P8 16W / 250W | 0MiB / 23040MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ' Seems ollama finds the gpu: journalctl: ` Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:300: 22939 MB VRAM available, loading up to 150 GPU layers Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:436: starting llama runner Dec 16 18:30:05 tesla ollama[2245]: 2023/12/16 18:30:05 llama.go:494: waiting for llama runner to start responding Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes Dec 16 18:30:05 tesla ollama[2245]: ggml_init_cublas: found 1 CUDA devices: Dec 16 18:30:05 tesla ollama[2245]: Device 0: Tesla M40 24GB, compute capability 5.2 Dec 16 18:30:05 tesla ollama[2326]: {"timestamp":1702751405,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff1> Dec 16 18:30:05 tesla ollama[2326]: {"timestamp":1702751405,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":8,"n_threads_ba> Dec 16 18:30:05 tesla ollama[2245]: llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs> --- --- Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: ggml ctx size = 0.12 MiB Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: using CUDA for GPU acceleration Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: mem required = 70.43 MiB Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloading 32 repeating layers to GPU Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloading non-repeating layers to GPU Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: offloaded 33/33 layers to GPU Dec 16 18:31:49 tesla ollama[2245]: llm_load_tensors: VRAM used: 3577.56 MiB ` Loading a modell works fine, but error comes when trying to communicate, happens with any modell, even the smallest. Error: llama runner exited, you may not have enough available memory to run this model journalctl: ' Dec 16 18:31:50 tesla ollama[2245]: .................................................................................................. Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: n_ctx = 4096 Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: freq_base = 10000.0 Dec 16 18:31:50 tesla ollama[2245]: llama_new_context_with_model: freq_scale = 1 Dec 16 18:31:51 tesla ollama[2245]: llama_kv_cache_init: VRAM kv self = 2048.00 MB Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB Dec 16 18:31:51 tesla ollama[2245]: llama_build_graph: non-view tensors processed: 676/676 Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: compute buffer total size = 291.32 MiB Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB Dec 16 18:31:51 tesla ollama[2245]: llama_new_context_with_model: total VRAM used: 5913.57 MiB (model: 3577.56 MiB, context: 2336.00 MiB) Dec 16 18:31:51 tesla ollama[2588]: {"timestamp":1702751511,"level":"INFO","function":"main","line":3035,"message":"HTTP server listening","hostname":"127.0> Dec 16 18:31:51 tesla ollama[2588]: {"timestamp":1702751511,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"12> Dec 16 18:31:51 tesla ollama[2245]: 2023/12/16 18:31:51 llama.go:508: llama runner started in 2.201689 seconds Dec 16 18:31:51 tesla ollama[2245]: [GIN] 2023/12/16 - 18:31:51 | 200 | 2.311479662s | 127.0.0.1 | POST "/api/generate" Dec 16 18:32:14 tesla ollama[2588]: {"timestamp":1702751534,"level":"INFO","function":"log_server_request","line":2596,"message":"request","remote_addr":"12> Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:577: loaded 0 images **Dec 16 18:32:14 tesla ollama[2245]: cuBLAS error 15 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8448** Dec 16 18:32:14 tesla ollama[2245]: current device: 0 **Dec 16 18:32:14 tesla ollama[2245]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8448: !"cuBLAS error"** Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:451: signal: aborted (core dumped) Dec 16 18:32:14 tesla ollama[2245]: 2023/12/16 18:32:14 llama.go:525: llama runner stopped successfully Dec 16 18:32:14 tesla ollama[2245]: [GIN] 2023/12/16 - 18:32:14 | 200 | 601.813679ms | 127.0.0.1 | POST "/api/generate" ' Full log: https://www.evernote.com/shard/s16/sh/6d2eab19-c11f-7cf4-148c-9a5cd04dc944/Zwy3R7zsW8TvzDquK5Devnpko4BPwqNquvDt4nHLGCiecB_luwmk3sH8ug The gpu is a bit dated, so it might miss some features newer nvidia cards have. It is a affordable option to run with a lot of vram so would be nice if it was supported. When running ComfyUI i have to start with --disable-cuda-malloc. Regards, Bård Ove Myhr
GiteaMirror added the bug label 2026-04-28 03:38:14 -05:00
Author
Owner

@baardove commented on GitHub (Dec 16, 2023):

I kind of got it working by setting num_gpu to 40 as mentioned in another post. But it still creates error, and i suspect revert to using cpu. (it pulled a lot of memory)

No error in chat window though...

Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_expert = 8
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_expert_used = 2
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: rope scaling = linear
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: freq_base_train = 1000000.0
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: freq_scale_train = 1
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_yarn_orig_ctx = 32768
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: rope_finetuned = unknown
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model type = 7B
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model ftype = mostly Q4_0
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model params = 46.70 B
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model size = 24.62 GiB (4.53 BPW)
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: general.name = mistralai
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: BOS token = 1 ''
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: EOS token = 2 '
'
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: UNK token = 0 ''
Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: LF token = 13 '<0x0A>'
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: ggml ctx size = 0.39 MiB
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: using CUDA for GPU acceleration
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: mem required = 70.71 MiB
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloading 32 repeating layers to GPU
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloading non-repeating layers to GPU
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloaded 33/33 layers to GPU
Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: VRAM used: 25145.55 MiB
Dec 16 21:04:47 tesla ollama[8296]: ..........................................................................................
Dec 16 21:04:47 tesla ollama[8296]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: out of memory
Dec 16 21:04:47 tesla ollama[8296]: current device: 0
Dec 16 21:04:47 tesla ollama[8296]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: !"CUDA error"
Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: out of>
Dec 16 21:04:48 tesla ollama[8296]: current device: 0
Dec 16 21:04:48 tesla ollama[8296]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: !"CUDA error"
Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:459: error starting llama runner: llama runner process has terminated
Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:525: llama runner stopped successfully
Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:436: starting llama runner
Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:494: waiting for llama runner to start responding
Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GP>
Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948f>
Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":8,"n_threads_>
Dec 16 21:04:48 tesla ollama[8296]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blob>
Dec 16 21:04:48 tesla ollama[8296]: llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 32000, 1, 1 ]

<!-- gh-comment-id:1858931830 --> @baardove commented on GitHub (Dec 16, 2023): I kind of got it working by setting num_gpu to 40 as mentioned in another post. But it still creates error, and i suspect revert to using cpu. (it pulled a lot of memory) No error in chat window though... Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_expert = 8 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_expert_used = 2 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: rope scaling = linear Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: freq_base_train = 1000000.0 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: freq_scale_train = 1 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: n_yarn_orig_ctx = 32768 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: rope_finetuned = unknown Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model type = 7B Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model ftype = mostly Q4_0 Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model params = 46.70 B Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: model size = 24.62 GiB (4.53 BPW) Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: general.name = mistralai Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: BOS token = 1 '<s>' Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: EOS token = 2 '</s>' Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: UNK token = 0 '<unk>' Dec 16 21:04:43 tesla ollama[8296]: llm_load_print_meta: LF token = 13 '<0x0A>' Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: ggml ctx size = 0.39 MiB Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: using CUDA for GPU acceleration Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: mem required = 70.71 MiB Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloading 32 repeating layers to GPU Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloading non-repeating layers to GPU Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: offloaded 33/33 layers to GPU Dec 16 21:04:43 tesla ollama[8296]: llm_load_tensors: VRAM used: 25145.55 MiB Dec 16 21:04:47 tesla ollama[8296]: .......................................................................................... Dec 16 21:04:47 tesla ollama[8296]: CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: out of memory Dec 16 21:04:47 tesla ollama[8296]: current device: 0 Dec 16 21:04:47 tesla ollama[8296]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: !"CUDA error" Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: out of> Dec 16 21:04:48 tesla ollama[8296]: current device: 0 Dec 16 21:04:48 tesla ollama[8296]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:8955: !"CUDA error" Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:459: error starting llama runner: llama runner process has terminated Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:525: llama runner stopped successfully Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:436: starting llama runner Dec 16 21:04:48 tesla ollama[8296]: 2023/12/16 21:04:48 llama.go:494: waiting for llama runner to start responding Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GP> Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948f> Dec 16 21:04:48 tesla ollama[17744]: {"timestamp":1702760688,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":8,"n_threads_> Dec 16 21:04:48 tesla ollama[8296]: llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /usr/share/ollama/.ollama/models/blob> Dec 16 21:04:48 tesla ollama[8296]: llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 32000, 1, 1 ]
Author
Owner

@phalexo commented on GitHub (Dec 16, 2023):

I am running on the Maxwell architecture.

git clone --recursive https://github.com/jmorganca/ollama.git
cd ollama/llm/llama.cpp
vi generate_linux.go
//go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build ggml/build/cuda --target server --config Release
//go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner
//go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on
//go:generate cmake --build gguf/build/cuda --target server --config Release
//go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner
cd ../..
go generate ./...
go build .
<!-- gh-comment-id:1858983734 --> @phalexo commented on GitHub (Dec 16, 2023): I am running on the Maxwell architecture. ```bash git clone --recursive https://github.com/jmorganca/ollama.git cd ollama/llm/llama.cpp vi generate_linux.go ``` ```go //go:generate cmake -S ggml -B ggml/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build ggml/build/cuda --target server --config Release //go:generate mv ggml/build/cuda/bin/server ggml/build/cuda/bin/ollama-runner //go:generate cmake -S gguf -B gguf/build/cuda -DLLAMA_CUBLAS=on -DLLAMA_ACCELERATE=on -DLLAMA_K_QUANTS=on -DLLAMA_NATIVE=off -DLLAMA_AVX=on -DLLAMA_AVX2=off -DLLAMA_AVX512=off -DLLAMA_FMA=off -DLLAMA_F16C=off -DLLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 -DLLAMA_CUDA_FORCE_MMQ=on //go:generate cmake --build gguf/build/cuda --target server --config Release //go:generate mv gguf/build/cuda/bin/server gguf/build/cuda/bin/ollama-runner ``` ```bash cd ../.. go generate ./... go build . ```
Author
Owner

@phalexo commented on GitHub (Dec 16, 2023):

Do you have tensor cores on your GPU? I doubt it.

<!-- gh-comment-id:1858984048 --> @phalexo commented on GitHub (Dec 16, 2023): Do you have tensor cores on your GPU? I doubt it.
Author
Owner

@easp commented on GitHub (Dec 17, 2023):

The capabilities of the m40 are kind of limited, plus it looks like you are using older drivers.

<!-- gh-comment-id:1859227080 --> @easp commented on GitHub (Dec 17, 2023): The capabilities of the m40 are kind of limited, plus it looks like you are using older drivers.
Author
Owner

@sunzh231 commented on GitHub (Dec 19, 2023):

The same error occurred on M40, and it ran normally before version 0.1.11 (including 0.1.11).

<!-- gh-comment-id:1862827092 --> @sunzh231 commented on GitHub (Dec 19, 2023): The same error occurred on M40, and it ran normally before version 0.1.11 (including 0.1.11).
Author
Owner

@phalexo commented on GitHub (Dec 19, 2023):

You have to rebuild with LLAMA_CUDA_FORCE_MMQ=on
performance may be a bit worse, but it would work.

On Tue, Dec 19, 2023, 9:09 AM Richard Sun @.***> wrote:

The same error occurred on M40, and it ran normally before version 0.1.11
(including 0.1.11).


Reply to this email directly, view it on GitHub
https://github.com/jmorganca/ollama/issues/1566#issuecomment-1862827092,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABDD3ZMISHXYMITQTSAP5XTYKGN2FAVCNFSM6AAAAABAXZJXUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRSHAZDOMBZGI
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:1862897481 --> @phalexo commented on GitHub (Dec 19, 2023): You have to rebuild with LLAMA_CUDA_FORCE_MMQ=on performance may be a bit worse, but it would work. On Tue, Dec 19, 2023, 9:09 AM Richard Sun ***@***.***> wrote: > The same error occurred on M40, and it ran normally before version 0.1.11 > (including 0.1.11). > > — > Reply to this email directly, view it on GitHub > <https://github.com/jmorganca/ollama/issues/1566#issuecomment-1862827092>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABDD3ZMISHXYMITQTSAP5XTYKGN2FAVCNFSM6AAAAABAXZJXUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRSHAZDOMBZGI> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@bonswouar commented on GitHub (Jan 4, 2024):

I can confirm compilling with LLAMA_CUDA_FORCE_MMQ=on solves the issue for me too: I can now run 7B models with 10 to 20 layers on the gpu (which is an old GTX 970)

<!-- gh-comment-id:1877493475 --> @bonswouar commented on GitHub (Jan 4, 2024): I can confirm compilling with `LLAMA_CUDA_FORCE_MMQ=on` solves the issue for me too: I can now run 7B models with 10 to 20 layers on the gpu (which is an old GTX 970)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47369