[GH-ISSUE #1979] Unable to get Ollama to utilize GPU on Jetson Orin Nano 8Gb #63180

Closed
opened 2026-05-03 12:24:28 -05:00 by GiteaMirror · 81 comments
Owner

Originally created by @remy415 on GitHub (Jan 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1979

Originally assigned to: @dhiltgen on GitHub.

I've reviewed the great tutorial made by @bnodnarb here:
https://github.com/jmorganca/ollama/blob/main/docs/tutorials/nvidia-jetson.md

The Orin Nano is running Ubuntu 20.04 with Jetpack 5.1.2 (r35.4.1 L4T). The container is also running L4T version 35.4.1. Jetpack 5.1.2 comes with CUDA 11.4 installed with compatibility support for CUDA 11.8.

I also followed along with the other 3 Jetson-related issues and have not found a fix.

I have also:
Run ollama serve

  • with and without tmux
  • with and without tmux and LD_LIBRARY_PATH='/usr/local/cuda/lib64'
  • Using dustynv/stable-diffusion-webui:r35.4.1 container, installed ollama and ensured env variables set
    • Note: This container is able to provide accelerated processing of stable-diffusion-webui as-is

In each of the situations, I used the 'mistral-jetson' generated model. For each of them, I get a similar output:

2024/01/13 20:14:02 images.go:815: total unused blobs removed: 0
2024/01/13 20:14:02 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)
2024/01/13 20:14:03 shim_ext_server.go:142: Dynamic LLM variants [cuda]
2024/01/13 20:14:03 gpu.go:88: Detecting GPU type
2024/01/13 20:14:03 gpu.go:203: Searching for GPU management library libnvidia-ml.so
2024/01/13 20:14:03 gpu.go:248: Discovered GPU libraries: []
2024/01/13 20:14:03 gpu.go:203: Searching for GPU management library librocm_smi64.so
2024/01/13 20:14:03 gpu.go:248: Discovered GPU libraries: []
2024/01/13 20:14:03 routes.go:953: no GPU detected
[GIN] 2024/01/13 - 20:14:28 | 200 |      73.666µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/01/13 - 20:14:28 | 200 |    1.154281ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/01/13 - 20:14:28 | 200 |     644.279µs |       127.0.0.1 | POST     "/api/show"
2024/01/13 20:14:28 llm.go:71: GPU not available, falling back to CPU
2024/01/13 20:14:28 ext_server_common.go:136: Initializing internal llama server
(... llama_model_loading)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: mem required  = 3917.98 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.19 MiB
2024/01/13 20:14:31 ext_server_common.go:144: Starting internal llama main loop
[GIN] 2024/01/13 - 20:14:31 | 200 |  3.017526003s |       127.0.0.1 | POST     "/api/generate"
2024/01/13 20:14:48 ext_server_common.go:158: loaded 0 images
[GIN] 2024/01/13 - 20:15:04 | 200 | 16.039682856s |       127.0.0.1 | POST     "/api/generate"

Key outputs are:
2024/01/13 20:14:03 routes.go:953: no GPU detected
llm_load_tensors: mem required = 3917.98 MiB

Again, would just like to note that the stable-diffusion-webui application works with GPU, as well as the referenced docker container from dustynv. Any suggestions of things to check?

Update: I forgot to mention that I verified CPU and GPU activity using jtop in another terminal. Edited for formatting. Edited to add OS & Jetson versions. Edited to add CUDA version.

Originally created by @remy415 on GitHub (Jan 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1979 Originally assigned to: @dhiltgen on GitHub. I've reviewed the great tutorial made by @bnodnarb here: https://github.com/jmorganca/ollama/blob/main/docs/tutorials/nvidia-jetson.md The Orin Nano is running Ubuntu 20.04 with Jetpack 5.1.2 (r35.4.1 L4T). The container is also running L4T version 35.4.1. Jetpack 5.1.2 comes with CUDA 11.4 installed with compatibility support for CUDA 11.8. I also followed along with the other 3 Jetson-related issues and have not found a fix. I have also: Run ollama serve - with and without tmux - with and without tmux and LD_LIBRARY_PATH='/usr/local/cuda/lib64' - Using dustynv/stable-diffusion-webui:r35.4.1 container, installed ollama and ensured env variables set - Note: This container is able to provide accelerated processing of stable-diffusion-webui as-is In each of the situations, I used the 'mistral-jetson' generated model. For each of them, I get a similar output: ```2024/01/13 20:14:02 images.go:808: total blobs: 7 2024/01/13 20:14:02 images.go:815: total unused blobs removed: 0 2024/01/13 20:14:02 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20) 2024/01/13 20:14:03 shim_ext_server.go:142: Dynamic LLM variants [cuda] 2024/01/13 20:14:03 gpu.go:88: Detecting GPU type 2024/01/13 20:14:03 gpu.go:203: Searching for GPU management library libnvidia-ml.so 2024/01/13 20:14:03 gpu.go:248: Discovered GPU libraries: [] 2024/01/13 20:14:03 gpu.go:203: Searching for GPU management library librocm_smi64.so 2024/01/13 20:14:03 gpu.go:248: Discovered GPU libraries: [] 2024/01/13 20:14:03 routes.go:953: no GPU detected [GIN] 2024/01/13 - 20:14:28 | 200 | 73.666µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/13 - 20:14:28 | 200 | 1.154281ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/13 - 20:14:28 | 200 | 644.279µs | 127.0.0.1 | POST "/api/show" 2024/01/13 20:14:28 llm.go:71: GPU not available, falling back to CPU 2024/01/13 20:14:28 ext_server_common.go:136: Initializing internal llama server (... llama_model_loading) llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.11 MiB llm_load_tensors: mem required = 3917.98 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 159.19 MiB 2024/01/13 20:14:31 ext_server_common.go:144: Starting internal llama main loop [GIN] 2024/01/13 - 20:14:31 | 200 | 3.017526003s | 127.0.0.1 | POST "/api/generate" 2024/01/13 20:14:48 ext_server_common.go:158: loaded 0 images [GIN] 2024/01/13 - 20:15:04 | 200 | 16.039682856s | 127.0.0.1 | POST "/api/generate" ``` Key outputs are: `2024/01/13 20:14:03 routes.go:953: no GPU detected` `llm_load_tensors: mem required = 3917.98 MiB` Again, would just like to note that the stable-diffusion-webui application works with GPU, as well as the referenced docker container from dustynv. Any suggestions of things to check? Update: I forgot to mention that I verified CPU and GPU activity using jtop in another terminal. Edited for formatting. Edited to add OS & Jetson versions. Edited to add CUDA version.
Author
Owner

@Q-point commented on GitHub (Jan 20, 2024):

@remy415 any solution ? I am observing the same on AGX Orin. Seems like a bug on ollama.
I verified with Jtop also. Ollama only runs on ARM cpus

~$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ollama serve
2024/01/19 21:25:40 images.go:808: total blobs: 8
2024/01/19 21:25:40 images.go:815: total unused blobs removed: 0
2024/01/19 21:25:40 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)
2024/01/19 21:25:41 shim_ext_server.go:142: Dynamic LLM variants [cuda]
2024/01/19 21:25:41 gpu.go:88: Detecting GPU type
2024/01/19 21:25:41 gpu.go:203: Searching for GPU management library libnvidia-ml.so
2024/01/19 21:25:41 gpu.go:248: Discovered GPU libraries: []
2024/01/19 21:25:41 gpu.go:203: Searching for GPU management library librocm_smi64.so
2024/01/19 21:25:41 gpu.go:248: Discovered GPU libraries: []
2024/01/19 21:25:41 routes.go:953: no GPU detected

<!-- gh-comment-id:1902144563 --> @Q-point commented on GitHub (Jan 20, 2024): @remy415 any solution ? I am observing the same on AGX Orin. Seems like a bug on ollama. I verified with Jtop also. Ollama only runs on ARM cpus ~$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ollama serve 2024/01/19 21:25:40 images.go:808: total blobs: 8 2024/01/19 21:25:40 images.go:815: total unused blobs removed: 0 2024/01/19 21:25:40 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20) 2024/01/19 21:25:41 shim_ext_server.go:142: Dynamic LLM variants [cuda] 2024/01/19 21:25:41 gpu.go:88: Detecting GPU type 2024/01/19 21:25:41 gpu.go:203: Searching for GPU management library libnvidia-ml.so 2024/01/19 21:25:41 gpu.go:248: Discovered GPU libraries: [] 2024/01/19 21:25:41 gpu.go:203: Searching for GPU management library librocm_smi64.so 2024/01/19 21:25:41 gpu.go:248: Discovered GPU libraries: [] **2024/01/19 21:25:41 routes.go:953: no GPU detected**
Author
Owner

@remy415 commented on GitHub (Jan 20, 2024):

@Q-point nothing yet. I haven’t had time to troubleshoot. If I had to guess, I would say there may have been an update in the way Jetpack presents its drivers; I’m not an expert in Linux drivers, it’s just the only thing that makes sense given that @bnodnarb was able to get it working with little customization, and I doubt Ollama made any tweaks if it was already working so the only logical culprit is a change in the drivers.

I do know that NVidia made a change in the way it exposes CUDA to containers. Previously, containers would basically mount the installed drivers in the container. Now, the containers released by dustynv have the drivers baked in to the containers and the expectation from NVidia is the decoupling of host system drivers and container-used drivers.

<!-- gh-comment-id:1902185657 --> @remy415 commented on GitHub (Jan 20, 2024): @Q-point nothing yet. I haven’t had time to troubleshoot. If I had to guess, I would say there may have been an update in the way Jetpack presents its drivers; I’m not an expert in Linux drivers, it’s just the only thing that makes sense given that @bnodnarb was able to get it working with little customization, and I doubt Ollama made any tweaks if it was already working so the only logical culprit is a change in the drivers. I do know that NVidia made a change in the way it exposes CUDA to containers. Previously, containers would basically mount the installed drivers in the container. Now, the containers released by dustynv have the drivers baked in to the containers and the expectation from NVidia is the decoupling of host system drivers and container-used drivers.
Author
Owner

@dhiltgen commented on GitHub (Jan 26, 2024):

Is there any way to get libnvidia-ml.so installed on the system? What does the nvidial-smi output look like on these systems?

<!-- gh-comment-id:1912742454 --> @dhiltgen commented on GitHub (Jan 26, 2024): Is there any way to get `libnvidia-ml.so` installed on the system? What does the `nvidial-smi` output look like on these systems?
Author
Owner

@remy415 commented on GitHub (Jan 26, 2024):

I’ve tried to get it installed, but as dustynv pointed out in another post
somewhere (on my phone, will find it later) the Tegra line of SBCs running
Jetpack are integrated GPUs and aren’t compatible with nvml/
libnvidia-ml.so/nvidial-smi. This is changing in Jetpack 6.0 but that isn’t
releasing until at least March.

I spent some time poking around in the ollama source code to see what
exactly it needed from libnvidia-ml.so, but I was having difficulty finding
comparable syscalls on the Jetson because the system data tools I did find
are just Python scripts that call Python CUDA libraries; I didn’t dive too
far down that rabbit hole.

Another thing is that the llama_cpp that works for the Jetson is a custom
build done by dustynv that leverages the nvcc compiler. I tried injecting
his llama_cpp container and prebuilt binary into the ollama dockerfile
build but it didn’t work; I think there is something gpu_info passes to the
make process but I haven’t worked the kinks out of that yet, and I still
need to find what information the gpu_info.go routine is requiring from the
CUDA api to ensure it’s properly converted to the Jetson format.

Any insights you could provide on that front would be greatly appreciated.

On Fri, Jan 26, 2024 at 4:51 PM Daniel Hiltgen @.***>
wrote:

Is there any way to get libnvidia-ml.so installed on the system? What
does the nvidial-smi output look like on these systems?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/1979#issuecomment-1912742454,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AZFJEIVHV26XNH2YAYHEWOLYQQQNLAVCNFSM6AAAAABBZR3AA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSG42DENBVGQ
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:1912815684 --> @remy415 commented on GitHub (Jan 26, 2024): I’ve tried to get it installed, but as dustynv pointed out in another post somewhere (on my phone, will find it later) the Tegra line of SBCs running Jetpack are integrated GPUs and aren’t compatible with nvml/ libnvidia-ml.so/nvidial-smi. This is changing in Jetpack 6.0 but that isn’t releasing until at least March. I spent some time poking around in the ollama source code to see what exactly it needed from libnvidia-ml.so, but I was having difficulty finding comparable syscalls on the Jetson because the system data tools I did find are just Python scripts that call Python CUDA libraries; I didn’t dive too far down that rabbit hole. Another thing is that the llama_cpp that works for the Jetson is a custom build done by dustynv that leverages the nvcc compiler. I tried injecting his llama_cpp container and prebuilt binary into the ollama dockerfile build but it didn’t work; I think there is something gpu_info passes to the make process but I haven’t worked the kinks out of that yet, and I still need to find what information the gpu_info.go routine is requiring from the CUDA api to ensure it’s properly converted to the Jetson format. Any insights you could provide on that front would be greatly appreciated. On Fri, Jan 26, 2024 at 4:51 PM Daniel Hiltgen ***@***.***> wrote: > Is there any way to get libnvidia-ml.so installed on the system? What > does the nvidial-smi output look like on these systems? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/1979#issuecomment-1912742454>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AZFJEIVHV26XNH2YAYHEWOLYQQQNLAVCNFSM6AAAAABBZR3AA6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSG42DENBVGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@dhiltgen commented on GitHub (Jan 26, 2024):

That's unfortunate they didn't implement support for the management library. We've added a dependency on that to discover the available GPUs, and their memory information so we can determine how much we can load into the GPU. We do have a mechanism now to force a specific llm library with OLLAMA_LLM_LIBRARY (see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#llm-libraries for usage) however this doesn't currently play nicely with the memory prediction logic for GPUs - it's largely to force CPU mode at this point.

A potential path for us to consider here is to refine the memory prediction logic so you can tell us how much memory to use via env var and bypass the management library checks, and then force the cuda llm library, and that might be sufficient to get us working again on these systems.

<!-- gh-comment-id:1912840218 --> @dhiltgen commented on GitHub (Jan 26, 2024): That's unfortunate they didn't implement support for the management library. We've added a dependency on that to discover the available GPUs, and their memory information so we can determine how much we can load into the GPU. We do have a mechanism now to force a specific llm library with `OLLAMA_LLM_LIBRARY` (see https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#llm-libraries for usage) however this doesn't currently play nicely with the memory prediction logic for GPUs - it's largely to force CPU mode at this point. A potential path for us to consider here is to refine the memory prediction logic so you can tell us how much memory to use via env var and bypass the management library checks, and then force the cuda llm library, and that might be sufficient to get us working again on these systems.
Author
Owner

@remy415 commented on GitHub (Jan 27, 2024):

There's support for the library, kind of. I've compiled my own .so file essentially wrapping the nvml function parameters returned in the API calls. I'm not terribly great with C/C++ but I got it to compile with NVCC. TL;DR: I used cuda_runtime.h API calls to gather the same information returned by the NVML calls. It seems to have mostly worked but I'm running into errors. I'll be taking another try on it tomorrow some time, but meanwhile I've uploaded file changes if you're interested in taking a look. https://www.github.com/remy415/ollama_tegra_fix

The error I received:

2024/01/27 10:26:42 gpu.go:96: INFO Detecting GPU type
2024/01/27 10:26:42 gpu.go:219: INFO Searching for GPU management library libtegra-ml.so
2024/01/27 10:26:42 gpu.go:265: INFO Discovered GPU libraries: [/usr/local/cuda/lib64/libtegra-ml.so]
2024/01/27 10:26:42 gpu.go:102: INFO NVidia Jetson Device GPU detected
2024/01/27 10:26:42 gpu.go:136: INFO error looking up CUDA GPU memory: device memory info lookup failure 0: 1
2024/01/27 10:26:42 cpu_common.go:18: INFO CPU does not have vector extensions
2024/01/27 10:26:42 routes.go:966: INFO no GPU detected

Seems like the GetGpuInfo() function (memInfo call) failed. I'll need to take a look and make sure I implemented the API correctly and make sure the information is passed in the same format the Go routine expects. I haven't had a lot of time to troubleshoot today, this is just an initial draft of what I was thinking could be done to avoid leaving too much in the user's hands, assuming API usage is preferred over user-defined env variables for memory, etc.

<!-- gh-comment-id:1913114626 --> @remy415 commented on GitHub (Jan 27, 2024): There's support for the library, kind of. I've compiled my own .so file essentially wrapping the nvml function parameters returned in the API calls. I'm not terribly great with C/C++ but I got it to compile with NVCC. TL;DR: I used cuda_runtime.h API calls to gather the same information returned by the NVML calls. It seems to have mostly worked but I'm running into errors. I'll be taking another try on it tomorrow some time, but meanwhile I've uploaded file changes if you're interested in taking a look. https://www.github.com/remy415/ollama_tegra_fix The error I received: ``` 2024/01/27 10:26:42 gpu.go:96: INFO Detecting GPU type 2024/01/27 10:26:42 gpu.go:219: INFO Searching for GPU management library libtegra-ml.so 2024/01/27 10:26:42 gpu.go:265: INFO Discovered GPU libraries: [/usr/local/cuda/lib64/libtegra-ml.so] 2024/01/27 10:26:42 gpu.go:102: INFO NVidia Jetson Device GPU detected 2024/01/27 10:26:42 gpu.go:136: INFO error looking up CUDA GPU memory: device memory info lookup failure 0: 1 2024/01/27 10:26:42 cpu_common.go:18: INFO CPU does not have vector extensions 2024/01/27 10:26:42 routes.go:966: INFO no GPU detected ``` Seems like the GetGpuInfo() function (memInfo call) failed. I'll need to take a look and make sure I implemented the API correctly and make sure the information is passed in the same format the Go routine expects. I haven't had a lot of time to troubleshoot today, this is just an initial draft of what I was thinking could be done to avoid leaving too much in the user's hands, assuming API usage is preferred over user-defined env variables for memory, etc.
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

@remy415 here's the set of APIs we call in the library. The critical ones are nvmlInit_v2, nvmlDeviceGetCount_v2, nvmlDeviceGetHandleByIndex, and nvmlDeviceGetMemoryInfo. The rest can be stubbed to return "not supported" style errors and we should gracefully handle them.

<!-- gh-comment-id:1913224261 --> @dhiltgen commented on GitHub (Jan 27, 2024): @remy415 [here's](https://github.com/ollama/ollama/blob/main/gpu/gpu_info_cuda.c#L18-L29) the set of APIs we call in the library. The critical ones are `nvmlInit_v2`, `nvmlDeviceGetCount_v2`, `nvmlDeviceGetHandleByIndex`, and `nvmlDeviceGetMemoryInfo`. The rest can be stubbed to return "not supported" style errors and we should gracefully handle them.
Author
Owner

@remy415 commented on GitHub (Jan 27, 2024):

@dhiltgen Thank you, I'll take a look at that. I'll update the repo and let you know here if I get anything working.

<!-- gh-comment-id:1913305491 --> @remy415 commented on GitHub (Jan 27, 2024): @dhiltgen Thank you, I'll take a look at that. I'll update the repo and let you know here if I get anything working.
Author
Owner

@remy415 commented on GitHub (Jan 28, 2024):

@dhiltgen I was able to use the CUDA Runtime Driver API compiled with NVCC to grab an initialize, device count, "gethandlebyindex" was just assumed to be '0' -- there's no Driver API call for it but it's also not needed -- simply query device properties for a device index (can be done in a for loop using devicecount), and the memory info (memory max, memory used, used a diff to figure out the free memory), and the CUDA capability major & minor values. This will work on any CUDA device and doesn't require hooking into the NVML shared object. It only needs to be compiled with NVCC but that's included in the standard CUDA toolkit. The source code for the sample binary I made should work on any linux machine with CUDA drivers and a CUDA device.

Any reason you wouldn't want to switch to using the CUDA Runtime API instead of querying nvml? I'm not super experienced with the -isms/quirks of pinging the Runtime API vs pinging the NVML API, but if all you're doing is gathering device info before loading llama_cpp then leveraging the Runtime API instead of NVML would work great for a solution that is compatible with Jetson and desktop CUDA. I think there's device properties to cover most of what the rest of your typedefs were looking for too.

If you have a linux box with a CUDA device (I'll test it out in Windows at some point too, but I think it should work there as their device API calls seem to be system agnostic), please let me know if this would work as a suitable alternative to libnvidia-ml.so.

Files:

https://raw.githubusercontent.com/remy415/ollama_tegra_fix/master/tegra-ml-test.cu
https://raw.githubusercontent.com/remy415/ollama_tegra_fix/master/gpu_info_tegra.h

Compile command is
nvcc --ptxas-options=-v --compiler-options '-fPIC' -o tegra-ml-test.out tegra-ml-test.cu

Command output:

$ ./tegra-ml-test.out 
Device Number: 0 || nvmlInit_v2() and nvmlShutDown() good.
  Memory info:
    Total: 7834435584 (7471 MB)
    Used:  2238849024 (2135 MB)
    Free:  5595586560 (5336 MB)

  Device Count: 1
  CUDA Compute Capability: 8.7
<!-- gh-comment-id:1913393810 --> @remy415 commented on GitHub (Jan 28, 2024): @dhiltgen I was able to use the CUDA Runtime Driver API compiled with NVCC to grab an initialize, device count, "gethandlebyindex" was just assumed to be '0' -- there's no Driver API call for it but it's also not needed -- simply query device properties for a device index (can be done in a for loop using devicecount), and the memory info (memory max, memory used, used a diff to figure out the free memory), and the CUDA capability major & minor values. This will work on any CUDA device and doesn't require hooking into the NVML shared object. It only needs to be compiled with NVCC but that's included in the standard CUDA toolkit. The source code for the sample binary I made should work on any linux machine with CUDA drivers and a CUDA device. Any reason you wouldn't want to switch to using the CUDA Runtime API instead of querying nvml? I'm not super experienced with the -isms/quirks of pinging the Runtime API vs pinging the NVML API, but if all you're doing is gathering device info before loading llama_cpp then leveraging the Runtime API instead of NVML would work great for a solution that is compatible with Jetson and desktop CUDA. I think there's device properties to cover most of what the rest of your typedefs were looking for too. If you have a linux box with a CUDA device (I'll test it out in Windows at some point too, but I think it should work there as their device API calls seem to be system agnostic), please let me know if this would work as a suitable alternative to libnvidia-ml.so. Files: https://raw.githubusercontent.com/remy415/ollama_tegra_fix/master/tegra-ml-test.cu https://raw.githubusercontent.com/remy415/ollama_tegra_fix/master/gpu_info_tegra.h Compile command is `nvcc --ptxas-options=-v --compiler-options '-fPIC' -o tegra-ml-test.out tegra-ml-test.cu` Command output: ``` $ ./tegra-ml-test.out Device Number: 0 || nvmlInit_v2() and nvmlShutDown() good. Memory info: Total: 7834435584 (7471 MB) Used: 2238849024 (2135 MB) Free: 5595586560 (5336 MB) Device Count: 1 CUDA Compute Capability: 8.7 ```
Author
Owner

@dhiltgen commented on GitHub (Jan 28, 2024):

This sort of approach could be viable. A key aspect of our GPU discovery logic is relying on dlopen/dlsym (and LoadLibrary/GetProcAddress on windows) so that we can have a soft dependency on the underlying GPU libraries. This lets us fail gracefully at runtime and try multiple options before ultimately falling back to CPU mode if necessary. I believe this would translate into loading libcudart.so and wiring up these cuda* routines.

<!-- gh-comment-id:1913743769 --> @dhiltgen commented on GitHub (Jan 28, 2024): This sort of approach could be viable. A key aspect of our GPU discovery logic is relying on dlopen/dlsym (and LoadLibrary/GetProcAddress on windows) so that we can have a soft dependency on the underlying GPU libraries. This lets us fail gracefully at runtime and try multiple options before ultimately falling back to CPU mode if necessary. I believe this would translate into loading libcudart.so and wiring up these `cuda*` routines.
Author
Owner

@remy415 commented on GitHub (Jan 28, 2024):

I'll try and whip something up as gpu_info_tegra.c & gpu_info_tegra.h with the same structure as gpu_info_cuda.c/etc. I can see the benefits of keeping your code as loosely coupled to CUDA as possible.

Correct me if I'm wrong, but does ollama compile llama_cpp into it when you build it? I didn't see anything in the manual installation instructions. If so, Jetson doesn't work with the default compilations of llama_cpp and requires a build using a syntax I've copied from dustynv's llama_cpp container build. Where is the proper place to inject llama_cpp build flags?

<!-- gh-comment-id:1913749220 --> @remy415 commented on GitHub (Jan 28, 2024): I'll try and whip something up as gpu_info_tegra.c & gpu_info_tegra.h with the same structure as gpu_info_cuda.c/etc. I can see the benefits of keeping your code as loosely coupled to CUDA as possible. Correct me if I'm wrong, but does ollama compile llama_cpp into it when you build it? I didn't see anything in the manual installation instructions. If so, Jetson doesn't work with the default compilations of llama_cpp and requires a build using a syntax I've copied from dustynv's llama_cpp container build. Where is the proper place to inject llama_cpp build flags?
Author
Owner

@dhiltgen commented on GitHub (Jan 28, 2024):

We use go generate ./... as a mechanism to perform the native compilation of llama.cpp and leverage the cmake build rigging from the upstream repo (with some minor modifications for our use). Check out https://github.com/ollama/ollama/tree/main/llm/generate and in particular, gen_linux.sh and gen_common.sh - Build flags are somewhat tunable, but Jetson may require additional refinement of those scripts.

<!-- gh-comment-id:1913751106 --> @dhiltgen commented on GitHub (Jan 28, 2024): We use `go generate ./...` as a mechanism to perform the native compilation of llama.cpp and leverage the cmake build rigging from the upstream repo (with some [minor modifications](https://github.com/ollama/ollama/tree/main/llm/ext_server) for our use). Check out https://github.com/ollama/ollama/tree/main/llm/generate and in particular, gen_linux.sh and gen_common.sh - Build flags are somewhat tunable, but Jetson may require additional refinement of those scripts.
Author
Owner

@remy415 commented on GitHub (Jan 28, 2024):

I saw something in the comments about all your CUDA builds requiring/using AVX. Tegras ship with ARM64 CPUs that don't have AVX extensions, so if the logic automatically disables GPU support if AVX isn't present (as per the comments) then the GPU library loading will be skipped every time.

<!-- gh-comment-id:1913754148 --> @remy415 commented on GitHub (Jan 28, 2024): I saw something in the comments about all your CUDA builds requiring/using AVX. Tegras ship with ARM64 CPUs that don't have AVX extensions, so if the logic automatically disables GPU support if AVX isn't present (as per the comments) then the GPU library loading will be skipped every time.
Author
Owner

@dhiltgen commented on GitHub (Jan 28, 2024):

Good catch! Yes, that recently introduced logic needs to be x86 only. I'll get a PR up for that.

<!-- gh-comment-id:1913755184 --> @dhiltgen commented on GitHub (Jan 28, 2024): Good catch! Yes, that recently introduced logic needs to be x86 only. I'll get a PR up for that.
Author
Owner

@bnodnarb commented on GitHub (Jan 29, 2024):

Hi @remy415 and @dhiltgen - thanks for all the effort you both are putting into this. I can confirm that the original tutorial no longer works with Jetpack 5.1.2. I will try on the Jetpack 6 Developer Preview.

Thanks all!

<!-- gh-comment-id:1914270056 --> @bnodnarb commented on GitHub (Jan 29, 2024): Hi @remy415 and @dhiltgen - thanks for all the effort you both are putting into this. I can confirm that the original tutorial no longer works with Jetpack 5.1.2. I will try on the Jetpack 6 Developer Preview. Thanks all!
Author
Owner

@remy415 commented on GitHub (Jan 29, 2024):

Okay so I wrote a .c file for Tegra devices, edited the gpu.go to accommodate, and made a few tweaks here and there. Long story short: I recompiled llama_cpp and ollama, set a couple env variables, and got it to run.

https://github.com/remy415/ollama_tegra_fix.git

@dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something.

@bnodnarb Try the patch at my github repo and see if it works for you, it worked on my 8gb Orin Nano.

<!-- gh-comment-id:1915500978 --> @remy415 commented on GitHub (Jan 29, 2024): Okay so I wrote a .c file for Tegra devices, edited the gpu.go to accommodate, and made a few tweaks here and there. Long story short: I recompiled llama_cpp and ollama, set a couple env variables, and got it to run. https://github.com/remy415/ollama_tegra_fix.git @dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something. @bnodnarb Try the patch at my github repo and see if it works for you, it worked on my 8gb Orin Nano.
Author
Owner

@remy415 commented on GitHub (Jan 30, 2024):

@Q-point @bnodnarb Submitted a PR, should fix the Jetson issues.

@dhiltgen Not sure if you're tracking this or not :)

<!-- gh-comment-id:1917483888 --> @remy415 commented on GitHub (Jan 30, 2024): @Q-point @bnodnarb Submitted a PR, should fix the Jetson issues. @dhiltgen Not sure if you're tracking this or not :)
Author
Owner

@Q-point commented on GitHub (Feb 3, 2024):

@remy415 Are you running this within a docker. Even with the instructions above I still get compile errors when issuing :
go generate ./...


/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(630): warning: function "warp_reduce_sum(half2)" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(651): warning: function "warp_reduce_max(half2)" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1038): error: identifier "__hsub2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1039): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1030): warning: variable "d" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1058): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1059): error: identifier "__hadd2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1081): error: identifier "__hsub2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1082): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1069): warning: variable "d" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1105): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1106): error: identifier "__hadd2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1122): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1116): warning: variable "d" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5218): error: identifier "__hmul2" is undefined

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): warning: no operator "+=" matches these operands

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5237): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5237): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_0]"
(6467): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_1]"
(6476): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_0]"
(6485): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_1]"
(6494): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=1, dequantize_kernel=&dequantize_q8_0]"
(6503): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=1, qr=1, dequantize_kernel=&convert_f16]"
(6554): here

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(630): warning: function "warp_reduce_sum(half2)" was declared but never referenced

/home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(651): warning: function "warp_reduce_max(half2)" was declared but never referenced


<!-- gh-comment-id:1925028134 --> @Q-point commented on GitHub (Feb 3, 2024): @remy415 Are you running this within a docker. Even with the instructions above I still get compile errors when issuing : go generate ./... ``` /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(630): warning: function "warp_reduce_sum(half2)" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(651): warning: function "warp_reduce_max(half2)" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1038): error: identifier "__hsub2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1039): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1030): warning: variable "d" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1058): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1059): error: identifier "__hadd2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1081): error: identifier "__hsub2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1082): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1069): warning: variable "d" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1105): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1106): error: identifier "__hadd2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1122): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(1116): warning: variable "d" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5218): error: identifier "__hmul2" is undefined /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): warning: no operator "+=" matches these operands /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5237): error: more than one conversion function from "__half" to a built-in type applies: function "__half::operator float() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here function "__half::operator short() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here function "__half::operator unsigned short() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here function "__half::operator int() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here function "__half::operator unsigned int() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here function "__half::operator long long() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here function "__half::operator unsigned long long() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here function "__half::operator __nv_bool() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5237): error: more than one conversion function from "__half" to a built-in type applies: function "__half::operator float() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(204): here function "__half::operator short() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(222): here function "__half::operator unsigned short() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(225): here function "__half::operator int() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(228): here function "__half::operator unsigned int() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(231): here function "__half::operator long long() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(234): here function "__half::operator unsigned long long() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(237): here function "__half::operator __nv_bool() const" /usr/local/cuda/targets/aarch64-linux/include/cuda_fp16.hpp(241): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_0]" (6467): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_1]" (6476): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_0]" (6485): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_1]" (6494): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=1, dequantize_kernel=&dequantize_q8_0]" (6503): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(5232): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=1, qr=1, dequantize_kernel=&convert_f16]" (6554): here /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(630): warning: function "warp_reduce_sum(half2)" was declared but never referenced /home/dhq/Documents/ollama/llm/llama.cpp/ggml-cuda.cu(651): warning: function "warp_reduce_max(half2)" was declared but never referenced ```
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

No, I didn’t. Did you run go clean first? And did you pull from the repo linked in my pull request?

<!-- gh-comment-id:1925028898 --> @remy415 commented on GitHub (Feb 3, 2024): No, I didn’t. Did you run go clean first? And did you pull from the repo linked in my pull request?
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

https://github.com/remy415/ollama.git

<!-- gh-comment-id:1925029877 --> @remy415 commented on GitHub (Feb 3, 2024): https://github.com/remy415/ollama.git
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

I need more info about your setup and what repo you’re using

<!-- gh-comment-id:1925031276 --> @remy415 commented on GitHub (Feb 3, 2024): I need more info about your setup and what repo you’re using
Author
Owner

@Q-point commented on GitHub (Feb 3, 2024):

@remy415
I'm using your repo; Simply followed your instructions on https://github.com/remy415/ollama_tegra_fix

  • Jetson AGX orin 64Gb
  • go version go1.21.6 linux/arm64
  • cmake version 3.28.2

Regarding the comment in bold: Also ensure: IMPORTANT -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc is in the llm/generate/gen_linux.sh file under CUBLAS;

That is already included under:

        TEGRA_COMPILER="-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
        TEGRA_CUDA_DEFS="-DCMAKE_CUDA_STANDARD=17 -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on  -DLLAMA_CUDA_F16=1"
        
<!-- gh-comment-id:1925042489 --> @Q-point commented on GitHub (Feb 3, 2024): @remy415 I'm using your repo; Simply followed your instructions on https://github.com/remy415/ollama_tegra_fix - Jetson AGX orin 64Gb - go version go1.21.6 linux/arm64 - cmake version 3.28.2 Regarding the comment in bold: Also ensure: IMPORTANT -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc is in the llm/generate/gen_linux.sh file under CUBLAS; That is already included under: ``` TEGRA_COMPILER="-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc" TEGRA_CUDA_DEFS="-DCMAKE_CUDA_STANDARD=17 -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on -DLLAMA_CUDA_F16=1" ```
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

Yes, I created the guide before I made the package. I need to update the guide. If you’re running Jetpack 5, it should work if you clone the repo and install as is. I am currently working back my edits to make it align with the default install. I think if you ensure your ld_library_path is set it should work

<!-- gh-comment-id:1925046200 --> @remy415 commented on GitHub (Feb 3, 2024): Yes, I created the guide before I made the package. I need to update the guide. If you’re running Jetpack 5, it should work if you clone the repo and install as is. I am currently working back my edits to make it align with the default install. I think if you ensure your ld_library_path is set it should work
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

@Q-point use this repo in an empty folder, it’s the whole package https://github.com/remy415/ollama.git

<!-- gh-comment-id:1925046400 --> @remy415 commented on GitHub (Feb 3, 2024): @Q-point use this repo in an empty folder, it’s the whole package https://github.com/remy415/ollama.git
Author
Owner

@Q-point commented on GitHub (Feb 3, 2024):

OK, just got it working with that repo. https://github.com/remy415/ollama.git

<!-- gh-comment-id:1925046554 --> @Q-point commented on GitHub (Feb 3, 2024): OK, just got it working with that repo. https://github.com/remy415/ollama.git
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

Okay, let me know if the GPU acceleration works 😃

<!-- gh-comment-id:1925046944 --> @remy415 commented on GitHub (Feb 3, 2024): Okay, let me know if the GPU acceleration works 😃
Author
Owner

@Q-point commented on GitHub (Feb 3, 2024):

image

<!-- gh-comment-id:1925050031 --> @Q-point commented on GitHub (Feb 3, 2024): ![image](https://github.com/ollama/ollama/assets/5604553/2475fc4f-9164-436c-ac84-58c07783bcd5)
Author
Owner

@remy415 commented on GitHub (Feb 3, 2024):

Awesome! Now I kinda wish I got an AGX instead of 4 Orin Nanos 😂🤣

<!-- gh-comment-id:1925050637 --> @remy415 commented on GitHub (Feb 3, 2024): Awesome! Now I kinda wish I got an AGX instead of 4 Orin Nanos 😂🤣
Author
Owner

@dhiltgen commented on GitHub (Feb 5, 2024):

@dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something.

@remy415 could you open a draft PR with your WIP so we can take a look and provide feedback?

scratch that, you already did - just missed it.

<!-- gh-comment-id:1926059239 --> @dhiltgen commented on GitHub (Feb 5, 2024): > @dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something. ~~@remy415 could you open a draft PR with your WIP so we can take a look and provide feedback?~~ scratch that, you already did - just missed it.
Author
Owner

@remy415 commented on GitHub (Feb 5, 2024):

@dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something.

@remy415 could you open a draft PR with your WIP so we can take a look and provide feedback?

@dhiltgen
https://github.com/ollama/ollama/pull/2279

<!-- gh-comment-id:1926061857 --> @remy415 commented on GitHub (Feb 5, 2024): > > @dhiltgen I don't know how you want to approach incorporating this into Ollama for Jetson users, whether you want to incorporate it into the main branch or offer it as a patch or something. > > @remy415 could you open a draft PR with your WIP so we can take a look and provide feedback? @dhiltgen https://github.com/ollama/ollama/pull/2279
Author
Owner

@jhkuperus commented on GitHub (Feb 15, 2024):

I was reading along through this thread yesterday when I received my Jetson AGX Devkit. I couldn't get Ollama to use the CUDA-cores and then subsequently bricked my Jetson when upgrading stuff and trying to compile the version from #2279 .

Today I reflashed and upgraded everything on the Jetson, managed to get the version from @remy415's PR compiling and working when I start it as root. However, when I try to start it as a system service, it fails with a permission denied when trying to load the CUDA libraries. Here's the relevant output:

Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.748Z level=INFO source=gpu.go:133 msg="Detecting GPU type"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.748Z level=INFO source=gpu.go:317 msg="Searching for GPU management library libcudart.so"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.752Z level=INFO source=gpu.go:363 msg="Discovered GPU libraries: [/usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda-12/lib64/libcudart.so.12.2.140 /usr/local/cuda-12.2/lib64/libcudart.so.12.2.140]"
Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied
Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported
Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.756Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda/lib64/libcudart.so.12.2.140: cudart vram init failure: 999"
Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied
Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported
Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.758Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda-12/lib64/libcudart.so.12.2.140: cudart vram init failure: 999"
Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied
Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported
Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.759Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda-12.2/lib64/libcudart.so.12.2.140: cudart vram init failure: 999"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.759Z level=INFO source=gpu.go:317 msg="Searching for GPU management library librocm_smi64.so"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=routes.go:1036 msg="no GPU detected"

What am I missing here? The files in /usr/local/cuda/lib64 look like they have permissions that are set just fine:

lrwxrwxrwx 1 root root        15 Aug 16  2023 libcudart.so -> libcudart.so.12
lrwxrwxrwx 1 root root        21 Aug 16  2023 libcudart.so.12 -> libcudart.so.12.2.140
-rw-r--r-- 1 root root    747424 Aug 16  2023 libcudart.so.12.2.140
-rw-r--r-- 1 root root   1403952 Aug 16  2023 libcudart_static.a
<!-- gh-comment-id:1946008984 --> @jhkuperus commented on GitHub (Feb 15, 2024): I was reading along through this thread yesterday when I received my Jetson AGX Devkit. I couldn't get Ollama to use the CUDA-cores and then subsequently bricked my Jetson when upgrading stuff and trying to compile the version from #2279 . Today I reflashed and upgraded everything on the Jetson, managed to get the version from @remy415's PR compiling and working when I start it as root. However, when I try to start it as a system service, it fails with a permission denied when trying to load the CUDA libraries. Here's the relevant output: ``` Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.748Z level=INFO source=gpu.go:133 msg="Detecting GPU type" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.748Z level=INFO source=gpu.go:317 msg="Searching for GPU management library libcudart.so" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.752Z level=INFO source=gpu.go:363 msg="Discovered GPU libraries: [/usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda-12/lib64/libcudart.so.12.2.140 /usr/local/cuda-12.2/lib64/libcudart.so.12.2.140]" Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626 Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.756Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda/lib64/libcudart.so.12.2.140: cudart vram init failure: 999" Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626 Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.758Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda-12/lib64/libcudart.so.12.2.140: cudart vram init failure: 999" Feb 15 12:26:22 ubuntu ollama[5728]: NvRmMemInitNvmap failed with Permission denied Feb 15 12:26:22 ubuntu ollama[5728]: 356: Memory Manager Not supported Feb 15 12:26:22 ubuntu ollama[5728]: ****NvRmMemMgrInit failed**** error type: 196626 Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.759Z level=INFO source=gpu.go:375 msg="Unable to load CUDA management library /usr/local/cuda-12.2/lib64/libcudart.so.12.2.140: cudart vram init failure: 999" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.759Z level=INFO source=gpu.go:317 msg="Searching for GPU management library librocm_smi64.so" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=gpu.go:363 msg="Discovered GPU libraries: []" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" Feb 15 12:26:22 ubuntu ollama[5728]: time=2024-02-15T12:26:22.760Z level=INFO source=routes.go:1036 msg="no GPU detected" ``` What am I missing here? The files in `/usr/local/cuda/lib64` look like they have permissions that are set just fine: ``` lrwxrwxrwx 1 root root 15 Aug 16 2023 libcudart.so -> libcudart.so.12 lrwxrwxrwx 1 root root 21 Aug 16 2023 libcudart.so.12 -> libcudart.so.12.2.140 -rw-r--r-- 1 root root 747424 Aug 16 2023 libcudart.so.12.2.140 -rw-r--r-- 1 root root 1403952 Aug 16 2023 libcudart_static.a ```
Author
Owner

@remy415 commented on GitHub (Feb 15, 2024):

@jhkuperus which Jetson AGX do you have? What version of L4T/Jetpack? I see you’re running CUDA 12, did you install that manually? Jetson prior to Jetpack 6 comes with CUDA 11.4, and only the Orin series even supports Jetpack 6. Also JP6 is in beta, if you’re using it I recommend reflashing JP 5

Disregard, see below.

Note that if you have continuous issues with JP6, I would recommend reflashing JP5 since JP6 is still under "developer preview" and isn't a full release yet.

<!-- gh-comment-id:1946104790 --> @remy415 commented on GitHub (Feb 15, 2024): ~~@jhkuperus which Jetson AGX do you have? What version of L4T/Jetpack? I see you’re running CUDA 12, did you install that manually? Jetson prior to Jetpack 6 comes with CUDA 11.4, and only the Orin series even supports Jetpack 6. Also JP6 is in beta, if you’re using it I recommend reflashing JP 5~~ Disregard, see below. Note that if you have continuous issues with JP6, I would recommend reflashing JP5 since JP6 is still under "developer preview" and isn't a full release yet.
Author
Owner

@remy415 commented on GitHub (Feb 15, 2024):

@jhkuperus

... working when I start it as root. However, when I try to start it as a system service, it fails with a permission denied when trying to load the CUDA libraries.

oh I missed this part. Also I poked more into the errors being displayed and I'm frequently seeing one of two things pop up:

  1. Reboot the system with sudo init 6 or whichever mechanism you prefer
  2. Reload the nvidia kernel modules. On my Jetson Orin Nano, the only modules preset are 'nvidia'. You can check this with sudo lsmod | grep -i nvidia on your system. Note the -r in sudo modprobe -r has the same effect as rmmod command.
sudo modprobe -r nvidia
sudo modprobe nvidia

Other systems referenced nvidia_uvm, I didn't have this module on my system but if it's present on yours you can use the same commands:

sudo modprobe -r nvidia_uvm
sudo modprobe nvidia_uvm

Let me know if any of this works for you.

<!-- gh-comment-id:1946211315 --> @remy415 commented on GitHub (Feb 15, 2024): @jhkuperus > ... working when I start it as root. However, when I try to start it as a system service, it fails with a permission denied when trying to load the CUDA libraries. oh I missed this part. Also I poked more into the errors being displayed and I'm frequently seeing one of two things pop up: 1. Reboot the system with `sudo init 6` or whichever mechanism you prefer 2. Reload the nvidia kernel modules. On my Jetson Orin Nano, the only modules preset are 'nvidia'. You can check this with `sudo lsmod | grep -i nvidia` on your system. Note the `-r` in `sudo modprobe -r` has the same effect as `rmmod` command. ``` sudo modprobe -r nvidia sudo modprobe nvidia ``` Other systems referenced `nvidia_uvm`, I didn't have this module on my system but if it's present on yours you can use the same commands: ``` sudo modprobe -r nvidia_uvm sudo modprobe nvidia_uvm ``` Let me know if any of this works for you.
Author
Owner

@jhkuperus commented on GitHub (Feb 15, 2024):

These are the details of my Jetson device:

Device used to test:
Jetson AGX Orin Developer Kit 64GB
Jetpack 6.0DP, L4T 36.2.0
CUDA 12.2.140
CUDA Capability Supported 8.7
Go version 1.21.6
Cmake 3.22.1
nvcc 12.2.140

I received it yesterday and I couldn't get compilation to work on the pre-flashed JP5. So I tried to upgrade the system, but that broke more things than it made better. Today I reflashed the device with JP6 DP. Compilation went smoothly and it runs perfectly when I run it as root, or as my normal user. If I try to start it as the ollama-user, I get permission denied.

This is the output of my lsmod for nvidia-modules, I don't see a problem here:

nvidia_modeset       1253376  3
nvidia               1454080  7 nvidia_modeset
nvidia_vrs_pseq        16384  0
tegra_dce              98304  2 nvidia
tsecriscv              28672  1 nvidia
host1x_nvhost          40960  10 nvhost_isp5,nvhost_nvcsi_t194,nvidia,tegra_camera,nvhost_nvdla,nvhost_capture,nvhost_nvcsi,nvhost_pva,nvhost_vi5,nvidia_modeset
mc_utils               16384  3 nvidia,nvgpu,tegra_camera_platform
host1x_next           180224  8 tegra_drm_next,host1x_nvhost,host1x_fence,tegra_se,nvgpu,nvhost_nvdla,nvhost_pva,nvidia_modeset
drm                   602112  12 tegra_drm_next,drm_kms_helper,nvidia

I'm thinking maybe it's a permission problem on a device, or socket or something like that?

<!-- gh-comment-id:1946783522 --> @jhkuperus commented on GitHub (Feb 15, 2024): These are the details of my Jetson device: Device used to test: Jetson AGX Orin Developer Kit 64GB Jetpack 6.0DP, L4T 36.2.0 CUDA 12.2.140 CUDA Capability Supported 8.7 Go version 1.21.6 Cmake 3.22.1 nvcc 12.2.140 I received it yesterday and I couldn't get compilation to work on the pre-flashed JP5. So I tried to upgrade the system, but that broke more things than it made better. Today I reflashed the device with JP6 DP. Compilation went smoothly and it runs perfectly when I run it as root, or as my normal user. If I try to start it as the `ollama`-user, I get permission denied. This is the output of my `lsmod` for `nvidia`-modules, I don't see a problem here: ``` nvidia_modeset 1253376 3 nvidia 1454080 7 nvidia_modeset nvidia_vrs_pseq 16384 0 tegra_dce 98304 2 nvidia tsecriscv 28672 1 nvidia host1x_nvhost 40960 10 nvhost_isp5,nvhost_nvcsi_t194,nvidia,tegra_camera,nvhost_nvdla,nvhost_capture,nvhost_nvcsi,nvhost_pva,nvhost_vi5,nvidia_modeset mc_utils 16384 3 nvidia,nvgpu,tegra_camera_platform host1x_next 180224 8 tegra_drm_next,host1x_nvhost,host1x_fence,tegra_se,nvgpu,nvhost_nvdla,nvhost_pva,nvidia_modeset drm 602112 12 tegra_drm_next,drm_kms_helper,nvidia ``` I'm thinking maybe it's a permission problem on a device, or socket or something like that?
Author
Owner

@remy415 commented on GitHub (Feb 15, 2024):

I didn't realize you were running it as a separate user. I'm assuming you added the user manually? May I see the contents of your service file please?

<!-- gh-comment-id:1946792439 --> @remy415 commented on GitHub (Feb 15, 2024): I didn't realize you were running it as a separate user. I'm assuming you added the user manually? May I see the contents of your service file please?
Author
Owner

@jhkuperus commented on GitHub (Feb 15, 2024):

Hah! Just figured it out. The problem is actually quite simple: the ollama user had to be a member of the video group to fix it.

This user is added by the installation script provided by the Ollama repository itself. Maybe this is something we can still add to your PR?

<!-- gh-comment-id:1946801914 --> @jhkuperus commented on GitHub (Feb 15, 2024): Hah! Just figured it out. The problem is actually quite simple: the `ollama` user had to be a member of the `video` group to fix it. This user is added by the installation script provided by the Ollama repository itself. Maybe this is something we can still add to your PR?
Author
Owner

@remy415 commented on GitHub (Feb 15, 2024):

The official script adds the ollama user to the render group. If/when the PR is merged with the main branch, the rest should automatically work as the PR only affects which shared libraries are loaded when the binary is executed.
From the source code

configure_systemd() {
    if ! id ollama >/dev/null 2>&1; then
        status "Creating ollama user..."
        $SUDO useradd -r -s /bin/false -m -d /usr/share/ollama ollama
    fi
    if getent group render >/dev/null 2>&1; then
        status "Adding ollama user to render group..."
        $SUDO usermod -a -G render ollama
    fi

    status "Adding current user to ollama group..."
    $SUDO usermod -a -G ollama $(whoami)

    status "Creating ollama systemd service..."
    cat <<EOF | $SUDO tee /etc/systemd/system/ollama.service >/dev/null
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=$BINDIR/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"

[Install]
WantedBy=default.target
EOF
    SYSTEMCTL_RUNNING="$(systemctl is-system-running || true)"
    case $SYSTEMCTL_RUNNING in
        running|degraded)
            status "Enabling and starting ollama service..."
            $SUDO systemctl daemon-reload
            $SUDO systemctl enable ollama

            start_service() { $SUDO systemctl restart ollama; }
            trap start_service EXIT
            ;;
    esac
}
<!-- gh-comment-id:1946819854 --> @remy415 commented on GitHub (Feb 15, 2024): The official script adds the `ollama` user to the `render` group. If/when the PR is merged with the main branch, the rest should automatically work as the PR only affects which shared libraries are loaded when the binary is executed. [From the source code](https://github.com/ollama/ollama/blob/main/scripts/install.sh) ``` configure_systemd() { if ! id ollama >/dev/null 2>&1; then status "Creating ollama user..." $SUDO useradd -r -s /bin/false -m -d /usr/share/ollama ollama fi if getent group render >/dev/null 2>&1; then status "Adding ollama user to render group..." $SUDO usermod -a -G render ollama fi status "Adding current user to ollama group..." $SUDO usermod -a -G ollama $(whoami) status "Creating ollama systemd service..." cat <<EOF | $SUDO tee /etc/systemd/system/ollama.service >/dev/null [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=$BINDIR/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=$PATH" [Install] WantedBy=default.target EOF SYSTEMCTL_RUNNING="$(systemctl is-system-running || true)" case $SYSTEMCTL_RUNNING in running|degraded) status "Enabling and starting ollama service..." $SUDO systemctl daemon-reload $SUDO systemctl enable ollama start_service() { $SUDO systemctl restart ollama; } trap start_service EXIT ;; esac } ```
Author
Owner

@davidtheITguy commented on GitHub (Feb 18, 2024):

@Q-point Hi, could you please let me know what you did to resolve these build errors, I'm where you were: https://github.com/ollama/ollama/issues/1979#issuecomment-1925028134 Many thanks.

<!-- gh-comment-id:1951379222 --> @davidtheITguy commented on GitHub (Feb 18, 2024): @Q-point Hi, could you please let me know what you did to resolve these build errors, I'm where you were: https://github.com/ollama/ollama/issues/1979#issuecomment-1925028134 Many thanks.
Author
Owner

@Q-point commented on GitHub (Feb 18, 2024):

@davidtheITguy I used the suggested repo https://github.com/remy415/ollama.git . Compiled from source. You'll have to install go compiler.

<!-- gh-comment-id:1951439103 --> @Q-point commented on GitHub (Feb 18, 2024): @davidtheITguy I used the suggested repo https://github.com/remy415/ollama.git . Compiled from source. You'll have to install go compiler.
Author
Owner

@davidtheITguy commented on GitHub (Feb 18, 2024):

Yes sir, did all that. Also overlaid the [package_cudart_build] directory per the https://github.com/remy415/ollama_tegra_fix, loaded the suggested env vars too. Compiled from source and got the same errors you posted https://github.com/ollama/ollama/issues/1979#issuecomment-1925028134.

Appreciate the response and I'll keep plugging away good to know I'm on the right track.

<!-- gh-comment-id:1951444979 --> @davidtheITguy commented on GitHub (Feb 18, 2024): Yes sir, did all that. Also overlaid the `[package_cudart_build]` directory per the [https://github.com/remy415/ollama_tegra_fix](https://github.com/remy415/ollama_tegra_fix), loaded the suggested env vars too. Compiled from source and got the same errors you posted https://github.com/ollama/ollama/issues/1979#issuecomment-1925028134. Appreciate the response and I'll keep plugging away good to know I'm on the right track.
Author
Owner

@remy415 commented on GitHub (Feb 18, 2024):

Yes sir, did all that. Also overlaid the [package_cudart_build] directory per the https://github.com/remy415/ollama_tegra_fix, loaded the suggested env vars too. Compiled from source and got the same errors you posted #1979 (comment).

Appreciate the response and I'll keep plugging away good to know I'm on the right track.

@davidtheITguy what error messages are you getting? Is it the ones about the half precision posted above? When did you last pull the repo? It should work with any JP5 Jetson

<!-- gh-comment-id:1951451656 --> @remy415 commented on GitHub (Feb 18, 2024): > Yes sir, did all that. Also overlaid the `[package_cudart_build]` directory per the https://github.com/remy415/ollama_tegra_fix, loaded the suggested env vars too. Compiled from source and got the same errors you posted [#1979 (comment)](https://github.com/ollama/ollama/issues/1979#issuecomment-1925028134). > > Appreciate the response and I'll keep plugging away good to know I'm on the right track. @davidtheITguy what error messages are you getting? Is it the ones about the half precision posted above? When did you last pull the repo? It should work with any JP5 Jetson
Author
Owner

@davidtheITguy commented on GitHub (Feb 18, 2024):

Apologize in advance for posting the entire error output, but want you to see. Actually got pretty far into the build before this happened:

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(645): warning: function "warp_reduce_sum(half2)" was declared but never referenced
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(666): warning: function "warp_reduce_max(half2)" was declared but never referenced
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1071): error: identifier "__hsub2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1072): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1063): warning: variable "d" was declared but never referenced
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1091): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1092): error: identifier "__hadd2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1114): error: identifier "__hsub2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1115): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1102): warning: variable "d" was declared but never referenced
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1138): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1139): error: identifier "__hadd2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1155): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1149): warning: variable "d" was declared but never referenced
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5433): error: identifier "__hmul2" is undefined
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): warning: no operator "+=" matches these operands
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5452): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/include/cuda_fp16.hpp(241): here
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5452): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
/usr/local/cuda/include/cuda_fp16.hpp(204): here
            function "__half::operator short() const"
/usr/local/cuda/include/cuda_fp16.hpp(222): here
            function "__half::operator unsigned short() const"
/usr/local/cuda/include/cuda_fp16.hpp(225): here
            function "__half::operator int() const"
/usr/local/cuda/include/cuda_fp16.hpp(228): here
            function "__half::operator unsigned int() const"
/usr/local/cuda/include/cuda_fp16.hpp(231): here
            function "__half::operator long long() const"
/usr/local/cuda/include/cuda_fp16.hpp(234): here
            function "__half::operator unsigned long long() const"
/usr/local/cuda/include/cuda_fp16.hpp(237): here
            function "__half::operator __nv_bool() const"
/usr/local/cuda/include/cuda_fp16.hpp(241): here
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_0]"
(6768): here
/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_1]"
(6777): here

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_0]"
(6786): here

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_1]"
(6795): here

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=1, dequantize_kernel=&dequantize_q8_0]"
(6804): here

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands
            operand types are: half2 += __half2
          detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=1, qr=1, dequantize_kernel=&convert_f16]"
(6855): here

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(645): warning: function "warp_reduce_sum(half2)" was declared but never referenced

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(666): warning: function "warp_reduce_max(half2)" was declared but never referenced

/ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1696): warning: variable "ksigns64" was declared but never referenced

I tried to follow your directions to the letter. Intergated the ollama_tegra_fix repo, cloned your latest fork, etc.

Tool chain versions:

cmake version 3.20.2
go version go1.13.8 linux/arm64
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
<!-- gh-comment-id:1951463913 --> @davidtheITguy commented on GitHub (Feb 18, 2024): Apologize in advance for posting the entire error output, but want you to see. Actually got pretty far into the build before this happened: ``` /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(645): warning: function "warp_reduce_sum(half2)" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(666): warning: function "warp_reduce_max(half2)" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1071): error: identifier "__hsub2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1072): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1063): warning: variable "d" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1091): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1092): error: identifier "__hadd2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1114): error: identifier "__hsub2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1115): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1102): warning: variable "d" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1138): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1139): error: identifier "__hadd2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1155): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1149): warning: variable "d" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5433): error: identifier "__hmul2" is undefined /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): warning: no operator "+=" matches these operands /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5452): error: more than one conversion function from "__half" to a built-in type applies: function "__half::operator float() const" /usr/local/cuda/include/cuda_fp16.hpp(204): here function "__half::operator short() const" /usr/local/cuda/include/cuda_fp16.hpp(222): here function "__half::operator unsigned short() const" /usr/local/cuda/include/cuda_fp16.hpp(225): here function "__half::operator int() const" /usr/local/cuda/include/cuda_fp16.hpp(228): here function "__half::operator unsigned int() const" /usr/local/cuda/include/cuda_fp16.hpp(231): here function "__half::operator long long() const" /usr/local/cuda/include/cuda_fp16.hpp(234): here function "__half::operator unsigned long long() const" /usr/local/cuda/include/cuda_fp16.hpp(237): here function "__half::operator __nv_bool() const" /usr/local/cuda/include/cuda_fp16.hpp(241): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5452): error: more than one conversion function from "__half" to a built-in type applies: function "__half::operator float() const" /usr/local/cuda/include/cuda_fp16.hpp(204): here function "__half::operator short() const" /usr/local/cuda/include/cuda_fp16.hpp(222): here function "__half::operator unsigned short() const" /usr/local/cuda/include/cuda_fp16.hpp(225): here function "__half::operator int() const" /usr/local/cuda/include/cuda_fp16.hpp(228): here function "__half::operator unsigned int() const" /usr/local/cuda/include/cuda_fp16.hpp(231): here function "__half::operator long long() const" /usr/local/cuda/include/cuda_fp16.hpp(234): here function "__half::operator unsigned long long() const" /usr/local/cuda/include/cuda_fp16.hpp(237): here function "__half::operator __nv_bool() const" /usr/local/cuda/include/cuda_fp16.hpp(241): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_0]" (6768): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q4_1]" (6777): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_0]" (6786): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=2, dequantize_kernel=&dequantize_q5_1]" (6795): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=32, qr=1, dequantize_kernel=&dequantize_q8_0]" (6804): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(5447): error: no operator "+=" matches these operands operand types are: half2 += __half2 detected during instantiation of "void dequantize_mul_mat_vec<qk,qr,dequantize_kernel>(const void *, const dfloat *, float *, int, int) [with qk=1, qr=1, dequantize_kernel=&convert_f16]" (6855): here /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(645): warning: function "warp_reduce_sum(half2)" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(666): warning: function "warp_reduce_max(half2)" was declared but never referenced /ssd/ollama/ollama/llm/llama.cpp/ggml-cuda.cu(1696): warning: variable "ksigns64" was declared but never referenced ``` I tried to follow your directions to the letter. Intergated the ollama_tegra_fix repo, cloned your latest fork, etc. Tool chain versions: ``` cmake version 3.20.2 go version go1.13.8 linux/arm64 gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 ```
Author
Owner

@remy415 commented on GitHub (Feb 18, 2024):

@davidtheITguy Okay that error you're seeing happens to me when I try to compile and the cuda architectures are incorrectly set. Are you manually setting the architectures? Ensure you clear the variable CUDA_ARCHITECTURES with the command export CUDA_ARCHITECTURES="" and let the script configure it automatically. I've incorporated the necessary fixes into the code base and eliminated the need for most of the ollama_tegra_fix page. The only env variables that are absolutely critical are:
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include"
and the one regarding cpu generate is just helpful to Jetson users as it skips the AVX builds
export OLLAMA_SKIP_CPU_GENERATE="yes"

I would highly recommend deleting your ollama folder completely, re-clone https://github.com/remy415/ollama.git, ensure these 2 env variables only are set, and run go generate ./... && go build .

For reference, during the build process you should see a reference to CUDA_ARCHITECTURES. The only ones that should be loaded depend on your Jetson device and Jetpack version but it should detect it automatically if you are running any of the 3 Jetpack versions listed here:

Nano/TX1 = 5.3, TX2 = 6.2, Xavier = 7.2, Orin = 8.7
L4T_VERSION.major >= 36: # JetPack 6
CUDA_ARCHITECTURES = [87]
L4T_VERSION.major >= 34: # JetPack 5
CUDA_ARCHITECTURES = [72, 87]
L4T_VERSION.major == 32: # JetPack 4
CUDA_ARCHITECTURES = [53, 62, 72]

Edited for formatting

<!-- gh-comment-id:1951487936 --> @remy415 commented on GitHub (Feb 18, 2024): @davidtheITguy Okay that error you're seeing happens to me when I try to compile and the cuda architectures are incorrectly set. Are you manually setting the architectures? Ensure you clear the variable `CUDA_ARCHITECTURES` with the command `export CUDA_ARCHITECTURES=""` and let the script configure it automatically. I've incorporated the necessary fixes into the code base and eliminated the need for most of the ollama_tegra_fix page. The only env variables that are absolutely critical are: ```export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include"``` and the one regarding cpu generate is just helpful to Jetson users as it skips the AVX builds ```export OLLAMA_SKIP_CPU_GENERATE="yes"``` I would highly recommend deleting your ollama folder completely, re-clone `https://github.com/remy415/ollama.git`, ensure these 2 env variables only are set, and run `go generate ./... && go build .` For reference, during the build process you should see a reference to CUDA_ARCHITECTURES. The only ones that should be loaded depend on your Jetson device and Jetpack version but `it should detect it automatically` if you are running any of the 3 Jetpack versions listed here: Nano/TX1 = 5.3, TX2 = 6.2, Xavier = 7.2, Orin = 8.7 L4T_VERSION.major >= 36: # JetPack 6 `CUDA_ARCHITECTURES = [87]` L4T_VERSION.major >= 34: # JetPack 5 `CUDA_ARCHITECTURES = [72, 87]` L4T_VERSION.major == 32: # JetPack 4 `CUDA_ARCHITECTURES = [53, 62, 72]` Edited for formatting
Author
Owner

@remy415 commented on GitHub (Feb 18, 2024):

Also overlaid the [package_cudart_build] directory per the https://github.com/remy415/ollama_tegra_fix,

I just noticed this. Sorry, I haven't cleaned up the documentation yet as I'm still working on the code itself to get it cleaned up. This step is no longer needed as that package is no longer valid.

<!-- gh-comment-id:1951489836 --> @remy415 commented on GitHub (Feb 18, 2024): > Also overlaid the `[package_cudart_build]` directory per the https://github.com/remy415/ollama_tegra_fix, I just noticed this. Sorry, I haven't cleaned up the documentation yet as I'm still working on the code itself to get it cleaned up. This step is no longer needed as that package is no longer valid.
Author
Owner

@davidtheITguy commented on GitHub (Feb 19, 2024):

Understood and thank you ill circle back when the build is successful

<!-- gh-comment-id:1951578382 --> @davidtheITguy commented on GitHub (Feb 19, 2024): Understood and thank you ill circle back when the build is successful
Author
Owner

@davidtheITguy commented on GitHub (Feb 19, 2024):

Looks like the build is very close, but not quite there yet. go generate ./... seems to have completed successfully.

The go build . errors out with what appears to be an external reference issue: build github.com/jmorganca/ollama: cannot load crypto/ecdh: malformed module path "crypto/ecdh": missing dot in first path element.

My environment:
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include
OLLAMA_SKIP_CPU_GENERATE=yes

Build steps:

  1. conda activate ollama # activate private build environment
  2. git clone --depth=1 --recursive https://github.com/remy415/ollama
  3. go generate ./...
  4. go build .

Debug steps: I ran a go get -u and then go mod tidy to try to resolve any reference issues here is the output:

github.com/jmorganca/ollama/app/assets imports
        embed: malformed module path "embed": missing dot in first path element
github.com/jmorganca/ollama/app/assets imports
        io/fs: malformed module path "io/fs": missing dot in first path element
github.com/jmorganca/ollama/app/lifecycle imports
        log/slog: malformed module path "log/slog": missing dot in first path element
github.com/jmorganca/ollama/parser imports
        slices: malformed module path "slices": missing dot in first path element
github.com/jmorganca/ollama/auth imports
        golang.org/x/crypto/ssh imports
        golang.org/x/crypto/curve25519 imports
        crypto/ecdh: malformed module path "crypto/ecdh": missing dot in first path element

I couldn't find any local reference to "crypto/ecdh" so wondering if it is an external dependency...

<!-- gh-comment-id:1952949209 --> @davidtheITguy commented on GitHub (Feb 19, 2024): Looks like the build is very close, but not quite there yet. `go generate ./...` seems to have completed successfully. The `go build .` errors out with what appears to be an external reference issue: `build github.com/jmorganca/ollama: cannot load crypto/ecdh: malformed module path "crypto/ecdh": missing dot in first path element`. My environment: `LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include` `OLLAMA_SKIP_CPU_GENERATE=yes` Build steps: 1. `conda activate ollama # activate private build environment` 2. `git clone --depth=1 --recursive https://github.com/remy415/ollama` 3. `go generate ./...` 4. `go build .` Debug steps: I ran a `go get -u` and then `go mod tidy` to try to resolve any reference issues here is the output: ``` github.com/jmorganca/ollama/app/assets imports embed: malformed module path "embed": missing dot in first path element github.com/jmorganca/ollama/app/assets imports io/fs: malformed module path "io/fs": missing dot in first path element github.com/jmorganca/ollama/app/lifecycle imports log/slog: malformed module path "log/slog": missing dot in first path element github.com/jmorganca/ollama/parser imports slices: malformed module path "slices": missing dot in first path element github.com/jmorganca/ollama/auth imports golang.org/x/crypto/ssh imports golang.org/x/crypto/curve25519 imports crypto/ecdh: malformed module path "crypto/ecdh": missing dot in first path element ``` I couldn't find any local reference to "crypto/ecdh" so wondering if it is an external dependency...
Author
Owner

@remy415 commented on GitHub (Feb 19, 2024):

I found the below env on a forum post about some new changes in Go 1.13. Odd that you’re getting that error, I haven’t seen it before. I think it may have something to do with your conda environment. Try checking go env before and after activating your conda environment.

Try this:
export GO111MODULE=off

<!-- gh-comment-id:1952958079 --> @remy415 commented on GitHub (Feb 19, 2024): I found the below env on a forum post about some new changes in Go 1.13. Odd that you’re getting that error, I haven’t seen it before. I think it may have something to do with your conda environment. Try checking go env before and after activating your conda environment. Try this: `export GO111MODULE=off`
Author
Owner

@davidtheITguy commented on GitHub (Feb 20, 2024):

Ok got it to compile. Made a rookie move, my version of golang was something like 1.13, go.mod says 1.21 (but doesn't enforce for some reason). Upgrading go to 1.21.7 worked for me.

So for the next person just to sum up what's needed as of this comment to compile the fork for Nvidia GPUs:

  1. set up your environment
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include"
export OLLAMA_SKIP_CPU_GENERATE="yes"
  1. Clone the forked repo from @remy415: git clone --depth=1 --recursive https://github.com/remy415/ollama
  2. cd ollama
  3. go generate ./...
  4. go build .
  5. Do a manual ollama install HERE (you need to COPY the new executable to /usr/bin/ollama, don't 'curl' it)

BUT still no CUDA/GPU unfortunately, looks a perms issue on the CUDA libs:

Feb 19 19:40:32 nvidia-agx ollama[178771]: time=2024-02-19T19:40:32.613-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11]"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=gpu.go:133 msg="Detecting GPU type"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=gpu.go:320 msg="Searching for GPU management library libcudart.so"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.221-05:00 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: [/usr/local/cuda/lib64/libcudart.so.11.4.298 /usr/local/c>
Feb 19 19:40:40 nvidia-agx ollama[178771]: NvRmMemInitNvmap failed with Permission denied
Feb 19 19:40:40 nvidia-agx ollama[178771]: 549: Memory Manager Not supported
Feb 19 19:40:40 nvidia-agx ollama[178771]: ****NvRmMemInit failed**** error type: 196626
Feb 19 19:40:40 nvidia-agx ollama[178771]: *** NvRmMemInit failed NvRmMemConstructor
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.228-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda/lib64/libcudart.so.11.4.298:>
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.230-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda-11/lib64/libcudart.so.11.4.2>
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda-11.4/lib64/libcudart.so.11.4>
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:320 msg="Searching for GPU management library librocm_smi64.so"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: []"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=routes.go:1037 msg="no GPU detected"

I'll keep plugging away hopefully, this will all get settled and merged at some point.

Many thanks @remy415 for your assistance on the build.

<!-- gh-comment-id:1953330834 --> @davidtheITguy commented on GitHub (Feb 20, 2024): Ok got it to compile. Made a rookie move, my version of golang was something like 1.13, go.mod says 1.21 (but doesn't enforce for some reason). Upgrading go to 1.21.7 worked for me. So for the next person just to sum up what's needed as of this comment to compile the fork for Nvidia GPUs: 1. set up your environment ``` export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include" export OLLAMA_SKIP_CPU_GENERATE="yes" ``` 2. Clone the forked repo from @remy415: `git clone --depth=1 --recursive https://github.com/remy415/ollama` 3. `cd ollama` 4. `go generate ./...` 5. `go build .` 6. Do a manual ollama install [HERE](https://github.com/ollama/ollama/blob/main/docs/linux.md#manual-install) (you need to COPY the new executable to /usr/bin/ollama, don't 'curl' it) BUT still no CUDA/GPU unfortunately, looks a perms issue on the CUDA libs: ``` Feb 19 19:40:32 nvidia-agx ollama[178771]: time=2024-02-19T19:40:32.613-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11]" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=gpu.go:133 msg="Detecting GPU type" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.217-05:00 level=INFO source=gpu.go:320 msg="Searching for GPU management library libcudart.so" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.221-05:00 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: [/usr/local/cuda/lib64/libcudart.so.11.4.298 /usr/local/c> Feb 19 19:40:40 nvidia-agx ollama[178771]: NvRmMemInitNvmap failed with Permission denied Feb 19 19:40:40 nvidia-agx ollama[178771]: 549: Memory Manager Not supported Feb 19 19:40:40 nvidia-agx ollama[178771]: ****NvRmMemInit failed**** error type: 196626 Feb 19 19:40:40 nvidia-agx ollama[178771]: *** NvRmMemInit failed NvRmMemConstructor Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.228-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda/lib64/libcudart.so.11.4.298:> Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.230-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda-11/lib64/libcudart.so.11.4.2> Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:378 msg="Unable to load CUDA management library /usr/local/cuda-11.4/lib64/libcudart.so.11.4> Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:320 msg="Searching for GPU management library librocm_smi64.so" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=gpu.go:366 msg="Discovered GPU libraries: []" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" Feb 19 19:40:40 nvidia-agx ollama[178771]: time=2024-02-19T19:40:40.232-05:00 level=INFO source=routes.go:1037 msg="no GPU detected" ``` I'll keep plugging away hopefully, this will all get settled and merged at some point. Many thanks @remy415 for your assistance on the build.
Author
Owner

@remy415 commented on GitHub (Feb 20, 2024):

@davidtheITguy the user running the binary needs to be part of the render user group, and the video user group.
The command follows this syntax: sudo usermod -a -G examplegroup exampleusername.

sudo usermod -a -G render <user name>
sudo usermod -a -G video <user name>

If you’re running it as a service using the ollama user, then add the ollama user. If you’re running directly from CLI ensure your user is part of the group.

<!-- gh-comment-id:1953351463 --> @remy415 commented on GitHub (Feb 20, 2024): @davidtheITguy the user running the binary needs to be part of the `render` user group, and the `video` user group. The command follows this syntax: `sudo usermod -a -G examplegroup exampleusername`. `sudo usermod -a -G render <user name>` `sudo usermod -a -G video <user name>` If you’re running it as a service using the ollama user, then add the ollama user. If you’re running directly from CLI ensure your user is part of the group.
Author
Owner

@davidtheITguy commented on GitHub (Feb 20, 2024):

@remy415 Bingo! That did it, thanks for the help!

<!-- gh-comment-id:1953361586 --> @davidtheITguy commented on GitHub (Feb 20, 2024): @remy415 Bingo! That did it, thanks for the help!
Author
Owner

@davidtheITguy commented on GitHub (Feb 20, 2024):

@remy415 running lama2-7b cli is blazing fast on my Jetson Orin AGX 64GB. Here's to hoping you merge this soon and again many thanks for your assistance!

<!-- gh-comment-id:1953375116 --> @davidtheITguy commented on GitHub (Feb 20, 2024): @remy415 running lama2-7b cli is blazing fast on my Jetson Orin AGX 64GB. Here's to hoping you merge this soon and again many thanks for your assistance!
Author
Owner

@remy415 commented on GitHub (Feb 20, 2024):

@davidtheITguy glad I could help! I’m working on fixing some quirks in the differences between the two CUDA libraries, hope to be done soon.

<!-- gh-comment-id:1953384318 --> @remy415 commented on GitHub (Feb 20, 2024): @davidtheITguy glad I could help! I’m working on fixing some quirks in the differences between the two CUDA libraries, hope to be done soon.
Author
Owner

@telemetrieTP23 commented on GitHub (Feb 21, 2024):

Thanks for your help ( https://github.com/ollama/ollama/issues/2491#issuecomment-1954615027 ).
There with "preliminary build" you meant that i should build your fork ?

I have cmake version 3.28.2
go version go1.27.1 linux/arm64
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

tried your tips from there:
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include"
export OLLAMA_SKIP_CPU_GENERATE="1"
export CUDA_ARCHITECTURES="72;87"

But still when i start "go generate ./..."
Then it generates '-DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80' in the CMAKE _DEFS

and the errors like here https://github.com/ollama/ollama/issues/1979#issuecomment-1951463913
where produced....

<!-- gh-comment-id:1958277474 --> @telemetrieTP23 commented on GitHub (Feb 21, 2024): Thanks for your help ( https://github.com/ollama/ollama/issues/2491#issuecomment-1954615027 ). There with "preliminary build" you meant that i should build your fork ? I have cmake version 3.28.2 go version go1.27.1 linux/arm64 gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 tried your tips from there: export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/compat:/usr/local/cuda/include" export OLLAMA_SKIP_CPU_GENERATE="1" export CUDA_ARCHITECTURES="72;87" But still when i start "go generate ./..." Then it generates '-DCMAKE_CUDA_ARCHITECTURES=50;52;61;70;75;80' in the CMAKE _DEFS and the errors like here https://github.com/ollama/ollama/issues/1979#issuecomment-1951463913 where produced....
Author
Owner

@remy415 commented on GitHub (Feb 21, 2024):

@telemetrieTP23 i responded to your other post 😀

I had a typo in my original response and it’s updated now.

<!-- gh-comment-id:1958285356 --> @remy415 commented on GitHub (Feb 21, 2024): @telemetrieTP23 i responded to your other post 😀 I had a typo in my original response and it’s updated now.
Author
Owner

@davidtheITguy commented on GitHub (Feb 22, 2024):

@remy415 Hey I had a quick follow up question to all this. Your build is working great on my Orin AGX!
However, after a lot of testing, it appears that (believe it or not) we are offloading to only one GPU (there are two separate devices on the Orin). I've confirmed this with the standard Python HF libs which do use both GPUs during inference.

Here is some output from the ollama service:

Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: found 1 CUDA devices:
Feb 22 10:52:35 nvidia-agx ollama[1181]:   Device 0: Orin, compute capability 8.7, VMM: yes

image

Given beggars can't be choosers, have you run across this issue? Is there a switch perhaps for CUDA=all or similar?

Thanks!

<!-- gh-comment-id:1959804509 --> @davidtheITguy commented on GitHub (Feb 22, 2024): @remy415 Hey I had a quick follow up question to all this. Your build is working great on my Orin AGX! However, after a lot of testing, it appears that (believe it or not) we are offloading to only one GPU (there are two separate devices on the Orin). I've confirmed this with the standard Python HF libs which do use both GPUs during inference. Here is some output from the ollama service: ``` Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: CUDA_USE_TENSOR_CORES: no Feb 22 10:52:35 nvidia-agx ollama[1181]: ggml_init_cublas: found 1 CUDA devices: Feb 22 10:52:35 nvidia-agx ollama[1181]: Device 0: Orin, compute capability 8.7, VMM: yes ``` ![image](https://github.com/ollama/ollama/assets/74510247/e2fe6b49-b1e6-4e50-b498-26ed0e9e6211) Given beggars can't be choosers, have you run across this issue? Is there a switch perhaps for CUDA=all or similar? Thanks!
Author
Owner

@remy415 commented on GitHub (Feb 22, 2024):

@davidtheITguy According to the NVidia Jetson AGX Orin Technical documentation at https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf, it would seem that it's two graphics compute clusters with a unified front end.

image

I don't have the AGX Orin so I can't personally confirm, but one way I would do it is to run a device query that shows the getDeviceCount results (The ollama service logs do show the results of that API call further up towards the top, export OLLAMA_DEBUG="1" if you don't see it). Another way is I would run the python script you referenced while running JTOP, and compare the reported GPU usage when running ollama.

Last, the lines beginning with ggml_init_cublas are actually logs reported from the llama.cpp back end. The ollama code we have is purely for querying device information and verifying the device is present & ready to go. Most (if not all) of the performance-related code is in the llama.cpp back end, but based on what your screenshots are showing I believe your AGX Orin is firing on all cylinders.

<!-- gh-comment-id:1959827136 --> @remy415 commented on GitHub (Feb 22, 2024): @davidtheITguy According to the NVidia Jetson AGX Orin Technical documentation at [https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf](here), it would seem that it's two graphics compute clusters with a unified front end. ![image](https://github.com/ollama/ollama/assets/105550370/45bb60ec-fd2f-48f2-8744-292845278fc6) I don't have the AGX Orin so I can't personally confirm, but one way I would do it is to run a device query that shows the getDeviceCount results (The ollama service logs do show the results of that API call further up towards the top, `export OLLAMA_DEBUG="1"` if you don't see it). Another way is I would run the python script you referenced while running JTOP, and compare the reported GPU usage when running ollama. Last, the lines beginning with `ggml_init_cublas` are actually logs reported from the llama.cpp back end. The ollama code we have is purely for querying device information and verifying the device is present & ready to go. Most (if not all) of the performance-related code is in the llama.cpp back end, but based on what your screenshots are showing I believe your AGX Orin is firing on all cylinders.
Author
Owner

@davidtheITguy commented on GitHub (Feb 22, 2024):

@remy415 Got it, ty. Weird that only one GPU works (I definitely get both with the python/HF scripts). I'll report back when able.

<!-- gh-comment-id:1959882317 --> @davidtheITguy commented on GitHub (Feb 22, 2024): @remy415 Got it, ty. Weird that only one GPU works (I definitely get both with the python/HF scripts). I'll report back when able.
Author
Owner

@davidtheITguy commented on GitHub (Feb 22, 2024):

@remy415 quick clarify. You are correct, it does appear that both GPUs are exposed as one interface. The difference appears to be the "GPU Shared RAM" capability not being utilized by the Lama.cpp back end in this case. I'll keep digging.

<!-- gh-comment-id:1959961259 --> @davidtheITguy commented on GitHub (Feb 22, 2024): @remy415 quick clarify. You are correct, it does appear that both GPUs are exposed as one interface. The difference appears to be the "GPU Shared RAM" capability not being utilized by the Lama.cpp back end in this case. I'll keep digging.
Author
Owner

@remy415 commented on GitHub (Feb 22, 2024):

@davidtheITguy I don't know which model you have loaded but your JTOP is reporting 9.1G GPU Shared Ram being used, which is definitely more than a model like Mistral 7b uses (typically it uses ~4G RAM). I think your hardware is being fully leveraged

<!-- gh-comment-id:1959972262 --> @remy415 commented on GitHub (Feb 22, 2024): @davidtheITguy I don't know which model you have loaded but your JTOP is reporting 9.1G GPU Shared Ram being used, which is definitely more than a model like Mistral 7b uses (typically it uses ~4G RAM). I think your hardware is being fully leveraged
Author
Owner

@ToeiRei commented on GitHub (Mar 10, 2024):

I cannot reproduce your builds using go version go1.22.1 linux/arm64. Does anyone have a binary for an Orin NX 16 GB?

<!-- gh-comment-id:1987069059 --> @ToeiRei commented on GitHub (Mar 10, 2024): I cannot reproduce your builds using go version go1.22.1 linux/arm64. Does anyone have a binary for an Orin NX 16 GB?
Author
Owner

@remy415 commented on GitHub (Mar 10, 2024):

I cannot reproduce your builds using go version go1.22.1 linux/arm64. Does anyone have a binary for an Orin NX 16 GB?

It should build on Orin NX. Please clone the repo at my repo. Then
cd ollama
go generate ./… && go build .
When the build is done, either put it in a location on your path or use ./ollama serve and ./ollama run <model>. If you have any issues, please include logs so we can troubleshoot.

<!-- gh-comment-id:1987147625 --> @remy415 commented on GitHub (Mar 10, 2024): > I cannot reproduce your builds using go version go1.22.1 linux/arm64. Does anyone have a binary for an Orin NX 16 GB? It should build on Orin NX. Please clone the repo at [my repo](https://github.com/remy415/ollama). Then `cd ollama` `go generate ./… && go build .` When the build is done, either put it in a location on your path or use `./ollama serve` and `./ollama run <model>`. If you have any issues, please include logs so we can troubleshoot.
Author
Owner

@ToeiRei commented on GitHub (Mar 10, 2024):

Thanks @remy415 , already trying with your repo. Error on go build is

vbauer@jetson:~/ollama$ go build .
# github.com/jmorganca/ollama/gpu
In file included from gpu_info_nvml.h:4,
                 from gpu_info_nvml.c:5:
gpu_info_nvml.c: In function ‘nvml_check_vram’:
gpu_info_nvml.c:158:20: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘long long unsigned int’ [-Wformat=]
  158 |     LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total);
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~     ~~~~~~~~~~~~~
      |                                                          |
      |                                                          long long unsigned int
gpu_info.h:33:23: note: in definition of macro ‘LOG’
   33 |       fprintf(stderr, __VA_ARGS__); \
      |                       ^~~~~~~~~~~
gpu_info_nvml.c:158:42: note: format string is defined here
  158 |     LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total);
      |                                        ~~^
      |                                          |
      |                                          long int
      |                                        %lld
In file included from gpu_info_nvml.h:4,
                 from gpu_info_nvml.c:5:
gpu_info_nvml.c:159:20: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘long long unsigned int’ [-Wformat=]
  159 |     LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free);
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~     ~~~~~~~~~~~~
      |                                                         |
      |                                                         long long unsigned int
gpu_info.h:33:23: note: in definition of macro ‘LOG’
   33 |       fprintf(stderr, __VA_ARGS__); \
      |                       ^~~~~~~~~~~
gpu_info_nvml.c:159:41: note: format string is defined here
  159 |     LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free);
      |                                       ~~^
      |                                         |
      |                                         long int
      |                                       %lld
<!-- gh-comment-id:1987267264 --> @ToeiRei commented on GitHub (Mar 10, 2024): Thanks @remy415 , already trying with your repo. Error on go build is ``` vbauer@jetson:~/ollama$ go build . # github.com/jmorganca/ollama/gpu In file included from gpu_info_nvml.h:4, from gpu_info_nvml.c:5: gpu_info_nvml.c: In function ‘nvml_check_vram’: gpu_info_nvml.c:158:20: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘long long unsigned int’ [-Wformat=] 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total); | ^~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ | | | long long unsigned int gpu_info.h:33:23: note: in definition of macro ‘LOG’ 33 | fprintf(stderr, __VA_ARGS__); \ | ^~~~~~~~~~~ gpu_info_nvml.c:158:42: note: format string is defined here 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total); | ~~^ | | | long int | %lld In file included from gpu_info_nvml.h:4, from gpu_info_nvml.c:5: gpu_info_nvml.c:159:20: warning: format ‘%ld’ expects argument of type ‘long int’, but argument 4 has type ‘long long unsigned int’ [-Wformat=] 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free); | ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ | | | long long unsigned int gpu_info.h:33:23: note: in definition of macro ‘LOG’ 33 | fprintf(stderr, __VA_ARGS__); \ | ^~~~~~~~~~~ gpu_info_nvml.c:159:41: note: format string is defined here 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free); | ~~^ | | | long int | %lld ```
Author
Owner

@remy415 commented on GitHub (Mar 10, 2024):

@ToeiRei Yes, those are compiler warnings, but they are not critical errors. Those particular warnings are present even in the Ollama main builds. The binary should still have compiled and should work for you.

<!-- gh-comment-id:1987269718 --> @remy415 commented on GitHub (Mar 10, 2024): @ToeiRei Yes, those are compiler warnings, but they are not critical errors. Those particular warnings are present even in the Ollama main builds. The binary should still have compiled and should work for you.
Author
Owner

@ToeiRei commented on GitHub (Mar 10, 2024):

my bad. I had expected something like a compile done as it just showed the warnings and a prompt.

<!-- gh-comment-id:1987270396 --> @ToeiRei commented on GitHub (Mar 10, 2024): my bad. I had expected something like a compile done as it just showed the warnings and a prompt.
Author
Owner

@remy415 commented on GitHub (Mar 10, 2024):

No worries, I fell down the same rabbit hole myself

<!-- gh-comment-id:1987270964 --> @remy415 commented on GitHub (Mar 10, 2024): No worries, I fell down the same rabbit hole myself
Author
Owner

@ToeiRei commented on GitHub (Mar 10, 2024):

Thanks. After failing to restart ollama due to a lack of caffeine, we're finally cooking with gas... err cuda

<!-- gh-comment-id:1987273086 --> @ToeiRei commented on GitHub (Mar 10, 2024): Thanks. After failing to restart ollama due to a lack of caffeine, we're finally cooking with ~~gas... err~~ cuda
Author
Owner

@remy415 commented on GitHub (Mar 10, 2024):

That's great! How is the performance on the Orin NX?

<!-- gh-comment-id:1987278445 --> @remy415 commented on GitHub (Mar 10, 2024): That's great! How is the performance on the Orin NX?
Author
Owner

@ToeiRei commented on GitHub (Mar 10, 2024):

It's definitely slower than my RTX4070, but it does not feel too bad. Like a person live typing on a 13b model.

FROM vicuna:13b
PARAMETER num_gpu 999

but somehow there's got to be a problem with the model as it says Error: open /vicuna:13b: no such file or directory on ollama create - but running the model directly works fine.

<!-- gh-comment-id:1987280509 --> @ToeiRei commented on GitHub (Mar 10, 2024): It's definitely slower than my RTX4070, but it does not feel too bad. Like a person live typing on a 13b model. ``` FROM vicuna:13b PARAMETER num_gpu 999 ``` but somehow there's got to be a problem with the model as it says `Error: open /vicuna:13b: no such file or directory` on ollama create - but running the model directly works fine.
Author
Owner

@remy415 commented on GitHub (Mar 10, 2024):

While in the ollama directory, and with ollama serve running in the background:

./ollama show vicuna:13b --modelfile > Modelfile (use whatever name you want for Modelfile)

Edit the model file:
vim Modelfile (or whichever text editor you want to use) -> add PARAMETER num_gpu 999 underneath the "FROM" line

Create the model
./ollama create <NEW MODEL NAME OF YOUR CHOICE> -f ./Modelfile

It would seem the model discovery logic is currently broken as the Modelfile I had made previously also failed with the same error as yours, and I had to use the above commands to generate a new "template" with the correct path auto-populated.

<!-- gh-comment-id:1987285899 --> @remy415 commented on GitHub (Mar 10, 2024): While in the ollama directory, and with `ollama serve` running in the background: `./ollama show vicuna:13b --modelfile > Modelfile` (use whatever name you want for `Modelfile`) Edit the model file: `vim Modelfile` (or whichever text editor you want to use) -> add `PARAMETER num_gpu 999` underneath the "FROM" line Create the model `./ollama create <NEW MODEL NAME OF YOUR CHOICE> -f ./Modelfile` It would seem the model discovery logic is currently broken as the Modelfile I had made previously also failed with the same error as yours, and I had to use the above commands to generate a new "template" with the correct path auto-populated.
Author
Owner

@UserName-wang commented on GitHub (Mar 23, 2024):

@remy415 running lama2-7b cli is blazing fast on my Jetson Orin AGX 64GB. Here's to hoping you merge this soon and again many thanks for your assistance!
@davidtheITguy, did you run successfully on JETSON_JETPACK="36.2.0"?

I got some error messages:
=> ERROR [ollama-cuda-l4t-base 9/9] RUN go build . 20.3s

[ollama-cuda-l4t-base 9/9] RUN go build .:
11.85 # github.com/jmorganca/ollama/gpu
11.85 In file included from gpu_info_nvml.h:4,
11.85 from gpu_info_nvml.c:5:
11.85 gpu_info_nvml.c: In function 'nvml_check_vram':
11.85 gpu_info_nvml.c:158:20: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long unsigned int' [-Wformat=]
11.85 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total);
11.85 | ^~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~
11.85 | |
11.85 | long long unsigned int
11.85 gpu_info.h:33:23: note: in definition of macro 'LOG'
11.85 33 | fprintf(stderr, VA_ARGS);
11.85 | ^~~~~~~~~~~
11.85 gpu_info_nvml.c:158:42: note: format string is defined here
11.85 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total);
11.85 | ~~^
11.85 | |
11.85 | long int
11.85 | %lld
11.85 In file included from gpu_info_nvml.h:4,
11.85 from gpu_info_nvml.c:5:
11.85 gpu_info_nvml.c:159:20: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long unsigned int' [-Wformat=]
11.85 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free);
11.85 | ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~
11.85 | |
11.85 | long long unsigned int
11.85 gpu_info.h:33:23: note: in definition of macro 'LOG'
11.85 33 | fprintf(stderr, VA_ARGS);
11.85 | ^~~~~~~~~~~
11.85 gpu_info_nvml.c:159:41: note: format string is defined here
11.85 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free);
11.85 | ~~^
11.85 | |
11.85 | long int
11.85 | %lld
11.85 # github.com/jmorganca/ollama/gpu
11.85 gpu/gpu.go:91:15: undefined: AssetsDir


Dockerfile2:27

25 | WORKDIR /go/src/github.com/jmorganca/ollama
26 | RUN go generate ./...
27 | >>> RUN go build .
28 |
29 | # Runtime stages

ERROR: failed to solve: process "/bin/sh -c go build ." did not complete successfully: exit code: 1

The docker file I use:
Dockerfile2.txt
@remy415 , Could you please have a look at my error message and give me a hint? thank you!

<!-- gh-comment-id:2016525977 --> @UserName-wang commented on GitHub (Mar 23, 2024): > @remy415 running lama2-7b cli is blazing fast on my Jetson Orin AGX 64GB. Here's to hoping you merge this soon and again many thanks for your assistance! @davidtheITguy, did you run successfully on JETSON_JETPACK="36.2.0"? I got some error messages: => ERROR [ollama-cuda-l4t-base 9/9] RUN go build . 20.3s ------ > [ollama-cuda-l4t-base 9/9] RUN go build .: 11.85 # github.com/jmorganca/ollama/gpu 11.85 In file included from gpu_info_nvml.h:4, 11.85 from gpu_info_nvml.c:5: 11.85 gpu_info_nvml.c: In function 'nvml_check_vram': 11.85 gpu_info_nvml.c:158:20: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long unsigned int' [-Wformat=] 11.85 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total); 11.85 | ^~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ 11.85 | | 11.85 | long long unsigned int 11.85 gpu_info.h:33:23: note: in definition of macro 'LOG' 11.85 33 | fprintf(stderr, __VA_ARGS__); \ 11.85 | ^~~~~~~~~~~ 11.85 gpu_info_nvml.c:158:42: note: format string is defined here 11.85 158 | LOG(h.verbose, "[%d] CUDA totalMem %ld\n", i, memInfo.total); 11.85 | ~~^ 11.85 | | 11.85 | long int 11.85 | %lld 11.85 In file included from gpu_info_nvml.h:4, 11.85 from gpu_info_nvml.c:5: 11.85 gpu_info_nvml.c:159:20: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'long long unsigned int' [-Wformat=] 11.85 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free); 11.85 | ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ 11.85 | | 11.85 | long long unsigned int 11.85 gpu_info.h:33:23: note: in definition of macro 'LOG' 11.85 33 | fprintf(stderr, __VA_ARGS__); \ 11.85 | ^~~~~~~~~~~ 11.85 gpu_info_nvml.c:159:41: note: format string is defined here 11.85 159 | LOG(h.verbose, "[%d] CUDA freeMem %ld\n", i, memInfo.free); 11.85 | ~~^ 11.85 | | 11.85 | long int 11.85 | %lld 11.85 # github.com/jmorganca/ollama/gpu 11.85 gpu/gpu.go:91:15: undefined: AssetsDir ------ Dockerfile2:27 -------------------- 25 | WORKDIR /go/src/github.com/jmorganca/ollama 26 | RUN go generate ./... 27 | >>> RUN go build . 28 | 29 | # Runtime stages -------------------- ERROR: failed to solve: process "/bin/sh -c go build ." did not complete successfully: exit code: 1 The docker file I use: [Dockerfile2.txt](https://github.com/ollama/ollama/files/14732050/Dockerfile2.txt) @remy415 , Could you please have a look at my error message and give me a hint? thank you!
Author
Owner

@hangxingliu commented on GitHub (Mar 23, 2024):

Hi @UserName-wang . I guess that your issue is caused by @remy415 still developing on his main branch (forgot to push the changes for importing assets package)

I have run @remy415 's great forked version of ollama on my Jetson AGX Orin with L4T 36.2.0 (Jetpack 6.0 DP) few days. It works great with llama2:13b-chat-fp16 and gemma:7b-instruct-fp16 and it is very fast for daily usage. The ollama I built is based on this commit:
https://github.com/remy415/ollama/tree/23204a77d27cfdfbf80e0bdaf96116fcf1255039

<!-- gh-comment-id:2016545193 --> @hangxingliu commented on GitHub (Mar 23, 2024): Hi @UserName-wang . I guess that your issue is caused by @remy415 still developing on his main branch (forgot to push the changes for importing `assets` package) I have run @remy415 's great forked version of ollama on my Jetson AGX Orin with L4T 36.2.0 (Jetpack 6.0 DP) few days. It works great with `llama2:13b-chat-fp16` and `gemma:7b-instruct-fp16` and it is very fast for daily usage. The ollama I built is based on this commit: <https://github.com/remy415/ollama/tree/23204a77d27cfdfbf80e0bdaf96116fcf1255039>
Author
Owner

@remy415 commented on GitHub (Mar 23, 2024):

@UserName-wang @hangxingliu the fork is currently broken due to an incomplete merge. I’m working with @dhiltgen to get it fixed.

<!-- gh-comment-id:2016589074 --> @remy415 commented on GitHub (Mar 23, 2024): @UserName-wang @hangxingliu the fork is currently broken due to an incomplete merge. I’m working with @dhiltgen to get it fixed.
Author
Owner

@remy415 commented on GitHub (Mar 23, 2024):

@UserName-wang @hangxingliu
I have put in a temp fix here. I am not able to run a test build right now.

Try this branch here

<!-- gh-comment-id:2016591572 --> @remy415 commented on GitHub (Mar 23, 2024): @UserName-wang @hangxingliu I have put in a temp fix here. I am not able to run a test build right now. Try this branch [here](https://github.com/remy415/ollama/tree/tempfix)
Author
Owner

@UserName-wang commented on GitHub (Mar 24, 2024):

@UserName-wang @hangxingliu I have put in a temp fix here. I am not able to run a test build right now.

Try this branch here

@remy415 , Thank you a lot! now it works! and I'm sure ollama now running on GPU. I tried gemma and llama2 and running fast!
And thank you! @hangxingliu ! I hope some day I can be like help full like you to help someone else!

I used **go generate ./... && go build . t**o build this package on host( agx orin) it works. but faild to run in docker. the build process is successful (and detected GPU). but failed to run command: ollama run gemma:2b. the error message attached.
ollama_Docker_build_log.txt

Can you please have a look and give me some suggestions if you have time? thank you!

<!-- gh-comment-id:2016668587 --> @UserName-wang commented on GitHub (Mar 24, 2024): > @UserName-wang @hangxingliu I have put in a temp fix here. I am not able to run a test build right now. > > Try this branch [here](https://github.com/remy415/ollama/tree/tempfix) @remy415 , Thank you a lot! now it works! and I'm sure ollama now running on GPU. I tried gemma and llama2 and running fast! And thank you! @hangxingliu ! I hope some day I can be like help full like you to help someone else! I used **_go generate ./... && go build . t_**o build this package on host( agx orin) it works. but faild to run in docker. the build process is successful (and detected GPU). but failed to run command: ollama run gemma:2b. the error message attached. [ollama_Docker_build_log.txt](https://github.com/ollama/ollama/files/14734170/ollama_Docker_build_log.txt) Can you please have a look and give me some suggestions if you have time? thank you!
Author
Owner

@remy415 commented on GitHub (Mar 24, 2024):

@UserName-wang Just to confirm:

  1. You ran go generate ./... && go build . from your regular command line, and tried to run ollama run gemma:2b from a Docker container?
  2. You are running Jetson 6 with CUDA 12?

I'm going to look at the gpu.go file and see why it didn't print the path to the library it loads as well.

<!-- gh-comment-id:2016671606 --> @remy415 commented on GitHub (Mar 24, 2024): @UserName-wang Just to confirm: 1. You ran `go generate ./... && go build .` from your regular command line, and tried to run `ollama run gemma:2b` from a Docker container? 2. You are running Jetson 6 with CUDA 12? I'm going to look at the gpu.go file and see why it didn't print the path to the library it loads as well.
Author
Owner

@remy415 commented on GitHub (Mar 24, 2024):

@UserName-wang Just FYI I haven't worked out running this on Docker containers as the Jetson is a bit of an oddity with that. I suggest you look at dustynv's Jetson containers for running GPU accelerated stuff in docker containers as the runtime needs a special configuration to work on Jetsons.

<!-- gh-comment-id:2016672112 --> @remy415 commented on GitHub (Mar 24, 2024): @UserName-wang Just FYI I haven't worked out running this on Docker containers as the Jetson is a bit of an oddity with that. I suggest you look at dustynv's Jetson containers for running GPU accelerated stuff in docker containers as the runtime needs a special configuration to work on Jetsons.
Author
Owner

@UserName-wang commented on GitHub (Mar 24, 2024):

@UserName-wang Just to confirm:

  1. You ran go generate ./... && go build . from your regular command line, and tried to run ollama run gemma:2b from a Docker container?
  2. You are running Jetson 6 with CUDA 12?

I'm going to look at the gpu.go file and see why it didn't print the path to the library it loads as well.

answer to question 1, yes! and the docker already detected CUDA.
answer to question 2, yes!

<!-- gh-comment-id:2016777037 --> @UserName-wang commented on GitHub (Mar 24, 2024): > @UserName-wang Just to confirm: > > 1. You ran `go generate ./... && go build .` from your regular command line, and tried to run `ollama run gemma:2b` from a Docker container? > 2. You are running Jetson 6 with CUDA 12? > > I'm going to look at the gpu.go file and see why it didn't print the path to the library it loads as well. answer to question 1, yes! and the docker already detected CUDA. answer to question 2, yes!
Author
Owner

@remy415 commented on GitHub (Mar 24, 2024):

@UserName-wang ok so it runs outside of Docker, but does not run inside of Docker? The error message in your log suggested it is missing a driver, that’s why I asked if you configured it for Docker by following NVidias instructions for CUDA. I don’t have Jetpack 6 installed yet, I was waiting for the official release

<!-- gh-comment-id:2016802303 --> @remy415 commented on GitHub (Mar 24, 2024): @UserName-wang ok so it runs outside of Docker, but does not run inside of Docker? The error message in your log suggested it is missing a driver, that’s why I asked if you configured it for Docker by following NVidias instructions for CUDA. I don’t have Jetpack 6 installed yet, I was waiting for the official release
Author
Owner

@remy415 commented on GitHub (Mar 25, 2024):

Fixed with merge of #2279

<!-- gh-comment-id:2018790803 --> @remy415 commented on GitHub (Mar 25, 2024): Fixed with merge of #2279
Author
Owner

@remy415 commented on GitHub (Apr 15, 2024):

@UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his github page

<!-- gh-comment-id:2054271753 --> @remy415 commented on GitHub (Apr 15, 2024): @UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his [github page](https://github.com/dusty-nv/jetson-containers)
Author
Owner

@UserName-wang commented on GitHub (Apr 20, 2024):

Yes! I tested and it works!Thank you for your information!在 2024年4月15日,09:20,Jeremy @.***> 写道:
@UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his github page

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

<!-- gh-comment-id:2067522829 --> @UserName-wang commented on GitHub (Apr 20, 2024): Yes! I tested and it works!Thank you for your information!在 2024年4月15日,09:20,Jeremy ***@***.***> 写道: @UserName-wang dusty-nv merged a PR with a container for Ollama. You can find it on his github page —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63180