[GH-ISSUE #2587] Running on GPU #48034

Closed
opened 2026-04-28 06:29:52 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @shersoni610 on GitHub (Feb 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2587

Originally assigned to: @dhiltgen on GitHub.

Hello,
It seems, the response time of llama2:7b is slow on my linux machine. I am not sure if the code
is running on Nvidia card.

In a python code, how to ensure that Ollama models run on GPU?

Originally created by @shersoni610 on GitHub (Feb 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2587 Originally assigned to: @dhiltgen on GitHub. Hello, It seems, the response time of llama2:7b is slow on my linux machine. I am not sure if the code is running on Nvidia card. In a python code, how to ensure that Ollama models run on GPU?
Author
Owner

@jaifar530 commented on GitHub (Feb 19, 2024):

Hi

sudo apt install nvtop

during asking the question to the LLM, run nvtop and check the percentage

<!-- gh-comment-id:1952970508 --> @jaifar530 commented on GitHub (Feb 19, 2024): Hi sudo apt install nvtop during asking the question to the LLM, run nvtop and check the percentage
Author
Owner

@shersoni610 commented on GitHub (Feb 20, 2024):

Hello,

Thanks for the into: I see the that GPU usage is 0% and CPU 794%/ At least this confirms that the code is
running on CPU. How should I utilize GPU?

<!-- gh-comment-id:1953583563 --> @shersoni610 commented on GitHub (Feb 20, 2024): Hello, Thanks for the into: I see the that GPU usage is 0% and CPU 794%/ At least this confirms that the code is running on CPU. How should I utilize GPU?
Author
Owner

@jaifar530 commented on GitHub (Feb 20, 2024):

first you need to make sure that those two commends should show a valid outputs

$ nvidia-smi
$ nvcc --verison

if one of them is not giving an output, you will be given suggest CLI to install them "sudo apt install ... cuda .." or "sudo apt install ... nvidia .. driver" DON'T install them. and follow bellow steps

  1. go to the BIOS setting and disable secure boot
  2. then install the missing driver suggested to you above.
<!-- gh-comment-id:1953648795 --> @jaifar530 commented on GitHub (Feb 20, 2024): first you need to make sure that those two commends should show a valid outputs $ nvidia-smi $ nvcc --verison if one of them is not giving an output, you will be given suggest CLI to install them "sudo apt install ... cuda .." or "sudo apt install ... nvidia .. driver" DON'T install them. and follow bellow steps 1. go to the BIOS setting and disable secure boot 2. then install the missing driver suggested to you above.
Author
Owner

@shersoni610 commented on GitHub (Feb 20, 2024):

Hello,

Both the commands are working. I still see high cpu usage and zero for GPU.

<!-- gh-comment-id:1954136309 --> @shersoni610 commented on GitHub (Feb 20, 2024): Hello, Both the commands are working. I still see high cpu usage and zero for GPU.
Author
Owner

@jaifar530 commented on GitHub (Feb 20, 2024):

Hello,

Both the commands are working. I still see high cpu usage and zero for GPU.

Do one more thing,

  1. Make sure the ollama prompt is closed. During that run the nvtop command and check the GPU Ram utlization..

  2. Then ollama run llama2:7b

  3. At the same time of (2) check the GPU ram utilisation, is it same as before running ollama?

If same, then maybe the gpu is not suppoting cuda,

If not same, it goes up to 3-6 GB, then everything works fine with you and it is only ollama issue that many people has raised with current version which is GPU not supporting on higher layers

<!-- gh-comment-id:1954180233 --> @jaifar530 commented on GitHub (Feb 20, 2024): > Hello, > > Both the commands are working. I still see high cpu usage and zero for GPU. > Do one more thing, 1. Make sure the ollama prompt is closed. During that run the nvtop command and check the GPU Ram utlization.. 2. Then ollama run llama2:7b 3. At the same time of (2) check the GPU ram utilisation, is it same as before running ollama? If same, then maybe the gpu is not suppoting cuda, If not same, it goes up to 3-6 GB, then everything works fine with you and it is only ollama issue that many people has raised with current version which is GPU not supporting on higher layers
Author
Owner

@jaifar530 commented on GitHub (Feb 21, 2024):

Also, try to do freash installation or reinstall using this script it should show you if the GPU is dedected or not

image
<!-- gh-comment-id:1955979783 --> @jaifar530 commented on GitHub (Feb 21, 2024): Also, try to do freash installation or reinstall using this script it should show you if the GPU is dedected or not <img width="893" alt="image" src="https://github.com/ollama/ollama/assets/31308766/59b6709a-3ce9-4b2e-a82a-16120d405635">
Author
Owner

@shersoni610 commented on GitHub (Feb 21, 2024):

Thanks. I see the following:

Adding ollama user to render group...
Adding current user to ollama group...
Creating ollama systemd service...
Enabling and starting ollama service...
NVIDIA GPU installed.

I still see the high CPU usages and zero GPU utilization

<!-- gh-comment-id:1957370992 --> @shersoni610 commented on GitHub (Feb 21, 2024): Thanks. I see the following: >>> Adding ollama user to render group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... >>> NVIDIA GPU installed. I still see the high CPU usages and zero GPU utilization
Author
Owner

@oodzchen commented on GitHub (Feb 24, 2024):

Same here, I use RTX 3080 on Linux, the install script shows "NVIDIA GPU installed.", but neither nvtop or nvidia-smi outputs show any GPU usage when running the models, even the intel GPU is zero percentage.

<!-- gh-comment-id:1962261872 --> @oodzchen commented on GitHub (Feb 24, 2024): Same here, I use RTX 3080 on Linux, the install script shows "NVIDIA GPU installed.", but neither `nvtop` or `nvidia-smi` outputs show any GPU usage when running the models, even the intel GPU is zero percentage.
Author
Owner

@jaifar530 commented on GitHub (Feb 24, 2024):

Same here, I use RTX 3080 on Linux, the install script shows "NVIDIA GPU installed.", but neither nvtop or nvidia-smi outputs show any GPU usage when running the models, even the intel GPU is zero percentage.

Which LLM mosel you have used?

<!-- gh-comment-id:1962269365 --> @jaifar530 commented on GitHub (Feb 24, 2024): > Same here, I use RTX 3080 on Linux, the install script shows "NVIDIA GPU installed.", but neither `nvtop` or `nvidia-smi` outputs show any GPU usage when running the models, even the intel GPU is zero percentage. Which LLM mosel you have used?
Author
Owner

@oodzchen commented on GitHub (Feb 24, 2024):

@jaifar530 I've tried llama2, mistral and gemma, all the same.

<!-- gh-comment-id:1962269818 --> @oodzchen commented on GitHub (Feb 24, 2024): @jaifar530 I've tried llama2, mistral and gemma, all the same.
Author
Owner

@jaifar530 commented on GitHub (Feb 24, 2024):

@jaifar530 I've tried llama2, mistral and gemma, all the same.

Does nvcc --version show output?

<!-- gh-comment-id:1962270424 --> @jaifar530 commented on GitHub (Feb 24, 2024): > @jaifar530 I've tried llama2, mistral and gemma, all the same. Does `nvcc --version` show output?
Author
Owner

@oodzchen commented on GitHub (Feb 24, 2024):

Does nvcc --version show output?

I'm using openSUSE Tumbleweed, successfully installed cuda and cuda-tookit, but could not found the nvcc command. The nvidia-smi outputs show CUDA version is 12.3 .

<!-- gh-comment-id:1962278660 --> @oodzchen commented on GitHub (Feb 24, 2024): > Does `nvcc --version` show output? I'm using openSUSE Tumbleweed, successfully installed `cuda` and `cuda-tookit`, but could not found the `nvcc` command. The `nvidia-smi` outputs show CUDA version is 12.3 .
Author
Owner

@oodzchen commented on GitHub (Feb 24, 2024):

Does nvcc --version show output?

I just found the nvcc binary, the output is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
<!-- gh-comment-id:1962321286 --> @oodzchen commented on GitHub (Feb 24, 2024): > Does `nvcc --version` show output? I just found the nvcc binary, the output is ```shell nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0 ```
Author
Owner

@dhiltgen commented on GitHub (Feb 26, 2024):

Can you share the server log so we can see why it's not able to detect your GPU?

<!-- gh-comment-id:1964999259 --> @dhiltgen commented on GitHub (Feb 26, 2024): Can you share the server log so we can see why it's not able to detect your GPU?
Author
Owner

@oodzchen commented on GitHub (Feb 27, 2024):

Can you share the server log so we can see why it's not able to detect your GPU?

Here's the output of journalctl -xeu ollama.service

Feb 26 22:45:01 pc-opss systemd[1]: Started Ollama Service.
░░ Subject: A start job for unit ollama.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit ollama.service has finished successfully.
░░ 
░░ The job identifier is 302.
Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.556+08:00 level=INFO source=images.go:710 msg="total blobs: 18"
Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.558+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.559+08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.560+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu_avx2 rocm_v5 cpu cpu_avx cuda_v11]"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.059+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/libnvidia-ml.so.545.29.06 /usr/lib64/libnvidia-ml.so.545.29.06]"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.059+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib/libnvidia-ml.so.545.29.06: Unable to load /usr/lib/libnvidia-ml.so.545.29.06 library to query for Nvidia GPUs: /usr/lib/libnvidia-ml.so.545.29.06: wrong ELF class: ELFCLASS32"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib64/libnvidia-ml.so.545.29.06: nvml vram init failure: 4"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=routes.go:1042 msg="no GPU detected"
Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 |      28.694µs |       127.0.0.1 | HEAD     "/"
Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 |    1.607655ms |       127.0.0.1 | POST     "/api/show"
Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 |    1.136526ms |       127.0.0.1 | POST     "/api/show"
Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU"
Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1034691504/cpu_avx2/libext_server.so"
Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest))
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv  22:               general.quantization_version u32              = 2
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type  f32:   65 tensors
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type q4_0:  225 tensors
Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type q6_K:    1 tensors
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: format           = GGUF V3 (latest)
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: arch             = llama
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: vocab type       = SPM
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_vocab          = 32000
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_merges         = 0
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_ctx_train      = 4096
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd           = 4096
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_head           = 32
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_head_kv        = 32
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_layer          = 32
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_rot            = 128
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_head_k    = 128
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_head_v    = 128
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_gqa            = 1
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_k_gqa     = 4096
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_v_gqa     = 4096
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_ff             = 11008
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_expert         = 0
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_expert_used    = 0
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: rope scaling     = linear
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: freq_base_train  = 10000.0
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: freq_scale_train = 1
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_yarn_orig_ctx  = 4096
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: rope_finetuned   = unknown
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model type       = 7B
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model ftype      = Q4_0
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model params     = 6.74 B
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: general.name     = LLaMA v2
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: BOS token        = 1 '<s>'
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: EOS token        = 2 '</s>'
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: UNK token        = 0 '<unk>'
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_tensors: ggml ctx size =    0.11 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_tensors:        CPU buffer size =  3647.87 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: ..................................................................................................
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: n_ctx      = 2048
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: freq_base  = 10000.0
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: freq_scale = 1
Feb 27 10:14:45 pc-opss ollama[1248]: llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model:        CPU input buffer size   =    13.02 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model:        CPU compute buffer size =   160.00 MiB
Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: graph splits (measure): 1
Feb 27 10:14:46 pc-opss ollama[1248]: time=2024-02-27T10:14:46.008+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
Feb 27 10:14:46 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:46 | 200 |  1.156281055s |       127.0.0.1 | POST     "/api/chat"
Feb 27 10:19:42 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:42 | 200 |      12.678µs |       127.0.0.1 | HEAD     "/"
Feb 27 10:19:42 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:42 | 200 |      298.38µs |       127.0.0.1 | POST     "/api/show"
Feb 27 10:19:51 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:51 | 200 |  9.558827934s |       127.0.0.1 | POST     "/api/generate"
Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 |      12.532µs |       127.0.0.1 | HEAD     "/"
Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 |     202.579µs |       127.0.0.1 | POST     "/api/show"
Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 |      232.93µs |       127.0.0.1 | POST     "/api/show"
Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 |      96.649µs |       127.0.0.1 | POST     "/api/chat"
Feb 27 10:23:00 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:23:00 | 200 | 30.669833967s |       127.0.0.1 | POST     "/api/chat"

<!-- gh-comment-id:1965682691 --> @oodzchen commented on GitHub (Feb 27, 2024): > Can you share the server log so we can see why it's not able to detect your GPU? Here's the output of `journalctl -xeu ollama.service` ``` Feb 26 22:45:01 pc-opss systemd[1]: Started Ollama Service. ░░ Subject: A start job for unit ollama.service has finished successfully ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ A start job for unit ollama.service has finished successfully. ░░ ░░ The job identifier is 302. Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.556+08:00 level=INFO source=images.go:710 msg="total blobs: 18" Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.558+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.559+08:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" Feb 26 22:45:01 pc-opss ollama[1248]: time=2024-02-26T22:45:01.560+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v6 cpu_avx2 rocm_v5 cpu cpu_avx cuda_v11]" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.043+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.059+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/libnvidia-ml.so.545.29.06 /usr/lib64/libnvidia-ml.so.545.29.06]" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.059+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib/libnvidia-ml.so.545.29.06: Unable to load /usr/lib/libnvidia-ml.so.545.29.06 library to query for Nvidia GPUs: /usr/lib/libnvidia-ml.so.545.29.06: wrong ELF class: ELFCLASS32" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib64/libnvidia-ml.so.545.29.06: nvml vram init failure: 4" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=routes.go:1042 msg="no GPU detected" Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 | 28.694µs | 127.0.0.1 | HEAD "/" Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 | 1.607655ms | 127.0.0.1 | POST "/api/show" Feb 27 10:14:44 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:44 | 200 | 1.136526ms | 127.0.0.1 | POST "/api/show" Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=llm.go:77 msg="GPU not available, falling back to CPU" Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama1034691504/cpu_avx2/libext_server.so" Feb 27 10:14:44 pc-opss ollama[1248]: time=2024-02-27T10:14:44.974+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256:8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest)) Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 0: general.architecture str = llama Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 1: general.name str = LLaMA v2 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 2: llama.context_length u32 = 4096 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 10: general.file_type u32 = 2 Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 11: tokenizer.ggml.model str = llama Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... Feb 27 10:14:44 pc-opss ollama[1248]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - kv 22: general.quantization_version u32 = 2 Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type f32: 65 tensors Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type q4_0: 225 tensors Feb 27 10:14:45 pc-opss ollama[1248]: llama_model_loader: - type q6_K: 1 tensors Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_vocab: special tokens definition check successful ( 259/32000 ). Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: format = GGUF V3 (latest) Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: arch = llama Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: vocab type = SPM Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_vocab = 32000 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_merges = 0 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_ctx_train = 4096 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd = 4096 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_head = 32 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_head_kv = 32 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_layer = 32 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_rot = 128 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_head_k = 128 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_head_v = 128 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_gqa = 1 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_k_gqa = 4096 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_embd_v_gqa = 4096 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_norm_eps = 0.0e+00 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_ff = 11008 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_expert = 0 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_expert_used = 0 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: rope scaling = linear Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: freq_base_train = 10000.0 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: freq_scale_train = 1 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: n_yarn_orig_ctx = 4096 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: rope_finetuned = unknown Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model type = 7B Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model ftype = Q4_0 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model params = 6.74 B Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: general.name = LLaMA v2 Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: BOS token = 1 '<s>' Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: EOS token = 2 '</s>' Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: UNK token = 0 '<unk>' Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_print_meta: LF token = 13 '<0x0A>' Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_tensors: ggml ctx size = 0.11 MiB Feb 27 10:14:45 pc-opss ollama[1248]: llm_load_tensors: CPU buffer size = 3647.87 MiB Feb 27 10:14:45 pc-opss ollama[1248]: .................................................................................................. Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: n_ctx = 2048 Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: freq_base = 10000.0 Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: freq_scale = 1 Feb 27 10:14:45 pc-opss ollama[1248]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: CPU input buffer size = 13.02 MiB Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: CPU compute buffer size = 160.00 MiB Feb 27 10:14:45 pc-opss ollama[1248]: llama_new_context_with_model: graph splits (measure): 1 Feb 27 10:14:46 pc-opss ollama[1248]: time=2024-02-27T10:14:46.008+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" Feb 27 10:14:46 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:14:46 | 200 | 1.156281055s | 127.0.0.1 | POST "/api/chat" Feb 27 10:19:42 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:42 | 200 | 12.678µs | 127.0.0.1 | HEAD "/" Feb 27 10:19:42 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:42 | 200 | 298.38µs | 127.0.0.1 | POST "/api/show" Feb 27 10:19:51 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:19:51 | 200 | 9.558827934s | 127.0.0.1 | POST "/api/generate" Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 | 12.532µs | 127.0.0.1 | HEAD "/" Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 | 202.579µs | 127.0.0.1 | POST "/api/show" Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 | 232.93µs | 127.0.0.1 | POST "/api/show" Feb 27 10:22:13 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:22:13 | 200 | 96.649µs | 127.0.0.1 | POST "/api/chat" Feb 27 10:23:00 pc-opss ollama[1248]: [GIN] 2024/02/27 - 10:23:00 | 200 | 30.669833967s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@alienatorZ commented on GitHub (Feb 27, 2024):

I am having the same issue. I run with Ubuntu server 22.04, 2 Nvidia Tesla p40s. I run llama.cpp on the GPUs no problem. Ollama detected Nvidia GPU during installation but still runs on CPU.

<!-- gh-comment-id:1966503487 --> @alienatorZ commented on GitHub (Feb 27, 2024): I am having the same issue. I run with Ubuntu server 22.04, 2 Nvidia Tesla p40s. I run llama.cpp on the GPUs no problem. Ollama detected Nvidia GPU during installation but still runs on CPU.
Author
Owner

@alienatorZ commented on GitHub (Feb 27, 2024):

my systemctl log logs suspect:
Feb 27 13:21:19 llmsrv ollama[285412]: time=2024-02-27T13:21:19.370Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
Feb 27 13:21:19 llmsrv ollama[285412]: time=2024-02-27T13:21:19.371Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 cpu_avx cpu cpu_avx2 rocm_v6 cuda_v11>
Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.701Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.154.05>
Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=WARN source=gpu.go:128 msg="CPU does not have AVX or AVX2, disabling GPU support."
Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=routes.go:1042 msg="no GPU detected"

Why does it detect GPU then because of No AVX it disables GPU?

<!-- gh-comment-id:1966546267 --> @alienatorZ commented on GitHub (Feb 27, 2024): my systemctl log logs suspect: Feb 27 13:21:19 llmsrv ollama[285412]: time=2024-02-27T13:21:19.370Z level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" Feb 27 13:21:19 llmsrv ollama[285412]: time=2024-02-27T13:21:19.371Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [rocm_v5 cpu_avx cpu cpu_avx2 rocm_v6 cuda_v11> Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=gpu.go:94 msg="Detecting GPU type" Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.698Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so" Feb 27 13:21:21 llmsrv ollama[285412]: time=2024-02-27T13:21:21.701Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.154.05> Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=gpu.go:99 msg="Nvidia GPU detected" Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=WARN source=gpu.go:128 msg="CPU does not have AVX or AVX2, disabling GPU support." Feb 27 13:21:22 llmsrv ollama[285412]: time=2024-02-27T13:21:22.159Z level=INFO source=routes.go:1042 msg="no GPU detected" Why does it detect GPU then because of No AVX it disables GPU?
Author
Owner

@jaifar530 commented on GitHub (Feb 27, 2024):

I am having the same issue. I run with Ubuntu server 22.04, 2 Nvidia Tesla p40s. I run llama.cpp on the GPUs no problem. Ollama detected Nvidia GPU during installation but still runs on CPU.

Can you try it on small LLM ex. 2b , at same time run nvtop and see if gpu is utilised

<!-- gh-comment-id:1966555135 --> @jaifar530 commented on GitHub (Feb 27, 2024): > I am having the same issue. I run with Ubuntu server 22.04, 2 Nvidia Tesla p40s. I run llama.cpp on the GPUs no problem. Ollama detected Nvidia GPU during installation but still runs on CPU. Can you try it on small LLM ex. 2b , at same time run nvtop and see if gpu is utilised
Author
Owner

@alienatorZ commented on GitHub (Feb 27, 2024):

using Phi 2.7b still maxing CPU not using GPU.

<!-- gh-comment-id:1966625853 --> @alienatorZ commented on GitHub (Feb 27, 2024): using Phi 2.7b still maxing CPU not using GPU.
Author
Owner

@dhiltgen commented on GitHub (Feb 27, 2024):

@oodzchen thanks for the logs.

Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib64/libnvidia-ml.so.545.29.06: nvml vram init failure: 4"

This is the root cause of not being able to discover the GPU. I believe this maps to NVML_ERROR_NO_PERMISSION = 4 The current user does not have permission for operation.

As a quick test, you can try shutting down the system service (which runs as user ollama) and try to run it as root and confirm it properly detects the GPU.

What linux distro are you using?

<!-- gh-comment-id:1967298090 --> @dhiltgen commented on GitHub (Feb 27, 2024): @oodzchen thanks for the logs. ``` Feb 26 22:45:03 pc-opss ollama[1248]: time=2024-02-26T22:45:03.065+08:00 level=INFO source=gpu.go:323 msg="Unable to load CUDA management library /usr/lib64/libnvidia-ml.so.545.29.06: nvml vram init failure: 4" ``` This is the root cause of not being able to discover the GPU. I believe this maps to `NVML_ERROR_NO_PERMISSION = 4 The current user does not have permission for operation.` As a quick test, you can try shutting down the system service (which runs as user `ollama`) and try to run it as root and confirm it properly detects the GPU. What linux distro are you using?
Author
Owner

@dhiltgen commented on GitHub (Feb 27, 2024):

@shersoni610 can you provide your server log?

<!-- gh-comment-id:1967300981 --> @dhiltgen commented on GitHub (Feb 27, 2024): @shersoni610 can you provide your server log?
Author
Owner

@dhiltgen commented on GitHub (Feb 27, 2024):

@alienatorZ you hit issue #2187 - at present, we require AVX support for GPUs to be activated.

<!-- gh-comment-id:1967305536 --> @dhiltgen commented on GitHub (Feb 27, 2024): @alienatorZ you hit issue #2187 - at present, we require AVX support for GPUs to be activated.
Author
Owner

@alienatorZ commented on GitHub (Feb 28, 2024):

@dhiltgen Ahhhh I see now. Yes Proxmox was not enabling AVX! Thank you.

<!-- gh-comment-id:1968037826 --> @alienatorZ commented on GitHub (Feb 28, 2024): @dhiltgen Ahhhh I see now. Yes Proxmox was not enabling AVX! Thank you.
Author
Owner

@oodzchen commented on GitHub (Feb 28, 2024):

@dhiltgen I've edit the /etc/systemd/system/ollama.service, changed the Group= and User= to root, reload the daemons, restart the ollama service, now the GPU could be used. Thanks very much.

The distro I'm using is openSUSE Tumbleweed.

<!-- gh-comment-id:1968274900 --> @oodzchen commented on GitHub (Feb 28, 2024): @dhiltgen I've edit the `/etc/systemd/system/ollama.service`, changed the `Group=` and `User=` to root, reload the daemons, restart the ollama service, now the GPU could be used. Thanks very much. The distro I'm using is openSUSE Tumbleweed.
Author
Owner

@dhiltgen commented on GitHub (Feb 28, 2024):

@oodzchen that's great to hear. We would like to set things up to be able to run deprivileged so Ollama isn't running as root. There's probably a minor fix needed here in our install script to wire things up properly. We add the ollama user to the render group, but that must not be correct on openSUSE. Are you able to tell what group we should add from your system? Presumably it's some group your user account was added to so you can run tools like nvidia-smi

If you want to experiment, try adding the ollama user to the plausible groups until it works. See https://github.com/ollama/ollama/blob/main/scripts/install.sh#L87-L90 for reference.

<!-- gh-comment-id:1969510165 --> @dhiltgen commented on GitHub (Feb 28, 2024): @oodzchen that's great to hear. We would like to set things up to be able to run deprivileged so Ollama isn't running as root. There's probably a minor fix needed here in our install script to wire things up properly. We add the ollama user to the `render` group, but that must not be correct on openSUSE. Are you able to tell what group we should add from your system? Presumably it's some group your user account was added to so you can run tools like `nvidia-smi` If you want to experiment, try adding the ollama user to the plausible groups until it works. See https://github.com/ollama/ollama/blob/main/scripts/install.sh#L87-L90 for reference.
Author
Owner

@oodzchen commented on GitHub (Feb 29, 2024):

@dhiltgen I've manually reset the ollama service file, and tried adding ollama user to different groups and reload the services, turns out video is the right group.

<!-- gh-comment-id:1970412231 --> @oodzchen commented on GitHub (Feb 29, 2024): @dhiltgen I've manually reset the ollama service file, and tried adding ollama user to different groups and reload the services, turns out `video` is the right group.
Author
Owner

@xiaotianfotos commented on GitHub (Feb 29, 2024):

i 'm using Dify to connect to ollama service
when using ollama API,always load to CPU memory,
but when using Open AI API (by ollama),will load to GPU
111111

<!-- gh-comment-id:1971394510 --> @xiaotianfotos commented on GitHub (Feb 29, 2024): i 'm using Dify to connect to ollama service when using ollama API,always load to CPU memory, but when using Open AI API (by ollama),will load to GPU ![111111](https://github.com/ollama/ollama/assets/25025807/e3c6e59d-030f-4e1b-bca0-d448cf2830d4)
Author
Owner

@dhiltgen commented on GitHub (Feb 29, 2024):

@shersoni610 since you opened the issue, I'd like to confirm the fix I'm about to merge for @oodzchen will fix your problem as well before I close this issue.

<!-- gh-comment-id:1971553323 --> @dhiltgen commented on GitHub (Feb 29, 2024): @shersoni610 since you opened the issue, I'd like to confirm the fix I'm about to merge for @oodzchen will fix your problem as well before I close this issue.
Author
Owner

@dhiltgen commented on GitHub (Feb 29, 2024):

@xiaotianfotos can you open a new issue with logs attached for showing the two runs.

<!-- gh-comment-id:1971556843 --> @dhiltgen commented on GitHub (Feb 29, 2024): @xiaotianfotos can you open a new issue with logs attached for showing the two runs.
Author
Owner

@raffaelemancuso commented on GitHub (Jun 7, 2025):

nvtop seems only for Linux. If you have Windows, use nvitop: uvx nvitop

<!-- gh-comment-id:2952996755 --> @raffaelemancuso commented on GitHub (Jun 7, 2025): `nvtop` seems only for Linux. If you have Windows, use `nvitop`: `uvx nvitop`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48034