[GH-ISSUE #3406] Official arm64 build does not work on Jetson Nano Orin #2097

Closed
opened 2026-04-12 12:20:24 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @gab0220 on GitHub (Mar 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3406

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hello everyone, thank you for your work.

I'm using a Jetson Nano Orin. Following #3098, some days ago I done a git checkout using #2279 commit and install this version on my device. It works.

Today I tried to:

  • Install the v0.1.30 using this tutorial
  • Clean ollama list
  • Run ollama pull <model>
  • Run OLLAMA_DEBUG="1" ollama run <model>
    Output:
Error: Post "http://127.0.0.1:11434/api/chat": EOF

I also attach the output of journalctl -u ollama:

Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library [libcudart.so](https://libcudart.so/)*"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.692+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3349183846/runners/cuda_v11/[libcudart.so](https://libcudart.so/).11.0 /usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12.2/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140]"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.801+01:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.7"
Mar 29 11:16:17 ubuntu systemd[1]: Stopping Ollama Service...
Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Deactivated successfully.
Mar 29 11:16:17 ubuntu systemd[1]: Stopped Ollama Service.
Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Consumed 9.601s CPU time.

What did you expect to see?

So the I can't use model.

Steps to reproduce

No response

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

Other

Platform

No response

Ollama version

v0.1.30

GPU

Nvidia

GPU info

No response

CPU

No response

Other software

No response

Originally created by @gab0220 on GitHub (Mar 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3406 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hello everyone, thank you for your work. I'm using a Jetson Nano Orin. Following #3098, some days ago I done a ```git checkout``` using #2279 commit and install this version on my device. It works. Today I tried to: * Install the v0.1.30 using [this tutorial](https://github.com/ollama/ollama/blob/main/docs/tutorials/nvidia-jetson.md#running-ollama-on-nvidia-jetson-devices) * Clean ```ollama list``` * Run ```ollama pull <model>``` * Run ```OLLAMA_DEBUG="1" ollama run <model>``` Output: ``` Error: Post "http://127.0.0.1:11434/api/chat": EOF ``` I also attach the output of ```journalctl -u ollama```: ``` Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library [libcudart.so](https://libcudart.so/)*" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.692+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3349183846/runners/cuda_v11/[libcudart.so](https://libcudart.so/).11.0 /usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12.2/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140]" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.801+01:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.7" Mar 29 11:16:17 ubuntu systemd[1]: Stopping Ollama Service... Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Deactivated successfully. Mar 29 11:16:17 ubuntu systemd[1]: Stopped Ollama Service. Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Consumed 9.601s CPU time. ``` ### What did you expect to see? So the I can't use model. ### Steps to reproduce _No response_ ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture Other ### Platform _No response_ ### Ollama version v0.1.30 ### GPU Nvidia ### GPU info _No response_ ### CPU _No response_ ### Other software _No response_
GiteaMirror added the bugnvidia labels 2026-04-12 12:20:24 -05:00
Author
Owner

@remy415 commented on GitHub (Mar 30, 2024):

@gab0220 thank you for reporting this. The issue right now is the OS Jetsons run on aren’t able to use the CUDA libraries bundled by the process they use to compile the binary. We’re still trying to pinpoint the exact issue to see if there’s a way to continue using the same process with minor adjustments.

You should be able to quickly build the binary on your Jetson, note that it is no longer necessary to follow the referenced tutorial, though it should still work if you compile yourself.

First, set up environment variables

export GOLANG_VERSION=1.21.3
export GO_ARCH=arm64
export CMAKE_VERSION=3.22.1
export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include
export OLLAMA_SKIP_CPU_GENERATE="1"
export CGO_ENABLED="1"
export CMAKE_CUDA_ARCHITECTURES="72;87"

Ensure required tools are installed

sudo apt update && sudo apt install -y build-essentials
curl -s -L https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz | tar -zx -C /usr --strip-components 1
rm /usr/local/bin/cmake && update-alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 30
curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-${GO_ARCH}.tar.gz | tar xz -C /usr/local
ln -s /usr/local/go/bin/go /usr/local/bin/go
ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt

Clone repo and build. Ensure you first cd <project folder>

git clone https://github.com/ollama/ollama.git && cd ollama
go clean
go generate ./… && go build .

This will compile the Ollama binary for your Jetson and save it to your current directory. Remove the old Ollama binarysudo rm /usr/local/bin/ollama then copy the new one withsudo cp ollama /usr/local/bin/ollama. You can then restart your Ollama service.

<!-- gh-comment-id:2028118618 --> @remy415 commented on GitHub (Mar 30, 2024): @gab0220 thank you for reporting this. The issue right now is the OS Jetsons run on aren’t able to use the CUDA libraries bundled by the process they use to compile the binary. We’re still trying to pinpoint the exact issue to see if there’s a way to continue using the same process with minor adjustments. You should be able to quickly build the binary on your Jetson, note that it is no longer necessary to follow the referenced tutorial, though it should still work if you compile yourself. First, set up environment variables ``` export GOLANG_VERSION=1.21.3 export GO_ARCH=arm64 export CMAKE_VERSION=3.22.1 export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include export OLLAMA_SKIP_CPU_GENERATE="1" export CGO_ENABLED="1" export CMAKE_CUDA_ARCHITECTURES="72;87" ``` Ensure required tools are installed ``` sudo apt update && sudo apt install -y build-essentials curl -s -L https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz | tar -zx -C /usr --strip-components 1 rm /usr/local/bin/cmake && update-alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 30 curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-${GO_ARCH}.tar.gz | tar xz -C /usr/local ln -s /usr/local/go/bin/go /usr/local/bin/go ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt ``` Clone repo and build. Ensure you first `cd <project folder>` ``` git clone https://github.com/ollama/ollama.git && cd ollama go clean go generate ./… && go build . ``` This will compile the Ollama binary for your Jetson and save it to your current directory. Remove the old Ollama binary`sudo rm /usr/local/bin/ollama` then copy the new one with`sudo cp ollama /usr/local/bin/ollama`. You can then restart your Ollama service.
Author
Owner

@dhiltgen commented on GitHub (Apr 12, 2024):

I've adjusted the behavior of the system with the upcoming 0.1.32 release so that we'll load the cuda library from the LD_LIBRARY_PATH before our bundled version, which should help mitigate this. As long as you include the cuda lib dir in your LD_LIBRARY_PATH for the ollama server, it should work. Ultimately I'd still like to get an older glibc based build setup defined that has a cuda library that works on Jetson, so I'll keep this issue open for now.

<!-- gh-comment-id:2052640684 --> @dhiltgen commented on GitHub (Apr 12, 2024): I've adjusted the behavior of the system with the upcoming 0.1.32 release so that we'll load the cuda library from the LD_LIBRARY_PATH before our bundled version, which should help mitigate this. As long as you include the cuda lib dir in your LD_LIBRARY_PATH for the ollama server, it should work. Ultimately I'd still like to get an older glibc based build setup defined that has a cuda library that works on Jetson, so I'll keep this issue open for now.
Author
Owner

@CesarCalvoCobo commented on GitHub (Apr 18, 2024):

Hi , thanks again all for your work

I am trying to compile new version and getting always the same error :
/usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
/usr/bin/ld: cannot find ollama/llm/build/linux/arm64_static/libllama.a: No such file or directory

Also trying to install the bundled version directly including LD_LIBRARY_PATH and it runs but it does not load the models

<!-- gh-comment-id:2065219260 --> @CesarCalvoCobo commented on GitHub (Apr 18, 2024): Hi , thanks again all for your work I am trying to compile new version and getting always the same error : /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1 /usr/bin/ld: cannot find ollama/llm/build/linux/arm64_static/libllama.a: No such file or directory Also trying to install the bundled version directly including LD_LIBRARY_PATH and it runs but it does not load the models
Author
Owner

@remy415 commented on GitHub (Apr 18, 2024):

@CesarCalvoCobo are you setting OLLAMA_SKIP_CPU_GENERATE=1? If so, you should set it to OLLAMA_SKIP_CPU_GENERATE="". I've submitted a PR to fix this but in the mean time, you need to compile the CPU and ensure you also don't set OLLAMA_CPU_TARGET

<!-- gh-comment-id:2065284594 --> @remy415 commented on GitHub (Apr 18, 2024): @CesarCalvoCobo are you setting `OLLAMA_SKIP_CPU_GENERATE=1`? If so, you should set it to `OLLAMA_SKIP_CPU_GENERATE=""`. I've submitted a PR to fix this but in the mean time, you need to compile the CPU and ensure you also don't set `OLLAMA_CPU_TARGET`
Author
Owner

@remy415 commented on GitHub (Apr 19, 2024):

@CesarCalvoCobo Okay my PR got merged so you should be able to just pull the latest ollama repo and run the compile again

<!-- gh-comment-id:2066241513 --> @remy415 commented on GitHub (Apr 19, 2024): @CesarCalvoCobo Okay my PR got merged so you should be able to just pull the latest ollama repo and run the compile again
Author
Owner

@CesarCalvoCobo commented on GitHub (Apr 19, 2024):

Thank you so much @remy415 - I compiled it succesfully now

<!-- gh-comment-id:2066893002 --> @CesarCalvoCobo commented on GitHub (Apr 19, 2024): Thank you so much @remy415 - I compiled it succesfully now
Author
Owner

@remy415 commented on GitHub (May 2, 2024):

@dhiltgen yea everything is working well as of a couple weeks ago

<!-- gh-comment-id:2091908880 --> @remy415 commented on GitHub (May 2, 2024): @dhiltgen yea everything is working well as of a couple weeks ago
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

Sounds like we can close this as resolved. Please speak up if you have any lingering issues on Jetsons.

<!-- gh-comment-id:2123153385 --> @dhiltgen commented on GitHub (May 21, 2024): Sounds like we can close this as resolved. Please speak up if you have any lingering issues on Jetsons.
Author
Owner

@wilbert-vb commented on GitHub (Sep 12, 2024):

I have followed the instructions in: https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618

The build finished with success.

When running 'Ollama run {model}' I get the following error:

Error: llama runner process has terminated: CUDA error: the resource allocation failed current device: 0, in function cublas_handle at /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda/common.cuh:644 cublasCreate_v2(&cublas_handles[device]) /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

How would I solve this?

<!-- gh-comment-id:2346674849 --> @wilbert-vb commented on GitHub (Sep 12, 2024): I have followed the instructions in: https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 The build finished with success. When running 'Ollama run {model}' I get the following error: `Error: llama runner process has terminated: CUDA error: the resource allocation failed current device: 0, in function cublas_handle at /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda/common.cuh:644 cublasCreate_v2(&cublas_handles[device]) /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error` How would I solve this?
Author
Owner

@soulisalmed commented on GitHub (Sep 12, 2024):

Yes Same error here on Jetson AGX Orin 64gb.

<!-- gh-comment-id:2347177940 --> @soulisalmed commented on GitHub (Sep 12, 2024): Yes Same error here on Jetson AGX Orin 64gb.
Author
Owner

@remy415 commented on GitHub (Sep 12, 2024):

@wilbert-vb @soulisalmed
When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.

<!-- gh-comment-id:2347315401 --> @remy415 commented on GitHub (Sep 12, 2024): @wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.
Author
Owner

@wilbert-vb commented on GitHub (Sep 12, 2024):

@wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.

Hardware:

NVidia Jetson Orin NX 16GB, Carrier Board: ReComputer J401

Jetson_release:

Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit - Jetpack 6.0 [L4T 36.3.0]
NV Power Mode[1]: 10W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:

  • P-Number: p3767-0000
  • Module: NVIDIA Jetson Orin NX (16GB ram)
    Platform:
  • Distribution: Ubuntu 22.04 Jammy Jellyfish
  • Release: 5.15.136-tegra
    jtop:
  • Version: 4.2.9
    < - Service: Active
    Libraries:
  • CUDA: 12.2.140
  • cuDNN: 8.9.4.25
  • TensorRT: Not installed
  • VPI: 3.1.5
  • Vulkan: 1.3.204
  • OpenCV: 4.8.0 - with CUDA: NO

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
go version go1.21.3 linux/arm64

<!-- gh-comment-id:2347336891 --> @wilbert-vb commented on GitHub (Sep 12, 2024): > @wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc. Hardware: > NVidia Jetson Orin NX 16GB, Carrier Board: ReComputer J401 Jetson_release: > Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit - Jetpack 6.0 [L4T 36.3.0] NV Power Mode[1]: 10W Serial Number: [XXX Show with: jetson_release -s XXX] Hardware: > - P-Number: p3767-0000 > - Module: NVIDIA Jetson Orin NX (16GB ram) Platform: > - Distribution: Ubuntu 22.04 Jammy Jellyfish > - Release: 5.15.136-tegra jtop: > - Version: 4.2.9 < - Service: Active Libraries: > - CUDA: 12.2.140 > - cuDNN: 8.9.4.25 > - TensorRT: Not installed > - VPI: 3.1.5 > - Vulkan: 1.3.204 > - OpenCV: 4.8.0 - with CUDA: NO > gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 > go version go1.21.3 linux/arm64
Author
Owner

@remy415 commented on GitHub (Sep 12, 2024):

@wilbert-vb what is the size (Gb) of the model you are trying to use?

quick search suggests it’s related to device OOM

<!-- gh-comment-id:2347347484 --> @remy415 commented on GitHub (Sep 12, 2024): @wilbert-vb what is the size (Gb) of the model you are trying to use? quick search suggests it’s related to device OOM
Author
Owner

@wilbert-vb commented on GitHub (Sep 12, 2024):

@wilbert-vb what is the size (Gb) of the model you are trying to use?

quick search suggests it’s related to device OOM

Not sure that I understand your question.
Memory is 16GB
Storage is 500GB

Sorry:

mistral:latest f974a74358d6 4.1 GB 2 weeks ago
smollm:latest 95f6557a0f0f 990 MB 2 weeks ago
phi3.5:latest 3b387c8dd9b7 2.2 GB 2 weeks ago
gemma2:latest ff02c3702f32 5.4 GB 4 weeks ago
llama3.1:latest c4a76fe0c601 4.9 GB 4 weeks ago
openchat:latest 537a4e03b649 4.1 GB 4 weeks ago
gemma2:2b 8ccf136fdd52 1.6 GB 5 weeks ago
qwen2:latest e0d4e1163c58 4.4 GB 7 weeks ago

<!-- gh-comment-id:2347374822 --> @wilbert-vb commented on GitHub (Sep 12, 2024): > @wilbert-vb what is the size (Gb) of the model you are trying to use? > > quick search suggests it’s related to device OOM Not sure that I understand your question. Memory is 16GB Storage is 500GB Sorry: mistral:latest f974a74358d6 4.1 GB 2 weeks ago smollm:latest 95f6557a0f0f 990 MB 2 weeks ago phi3.5:latest 3b387c8dd9b7 2.2 GB 2 weeks ago gemma2:latest ff02c3702f32 5.4 GB 4 weeks ago llama3.1:latest c4a76fe0c601 4.9 GB 4 weeks ago openchat:latest 537a4e03b649 4.1 GB 4 weeks ago gemma2:2b 8ccf136fdd52 1.6 GB 5 weeks ago qwen2:latest e0d4e1163c58 4.4 GB 7 weeks ago
Author
Owner

@remy415 commented on GitHub (Sep 12, 2024):

Did the same error occur when you use smollm or gemma2 2b?

<!-- gh-comment-id:2347406726 --> @remy415 commented on GitHub (Sep 12, 2024): Did the same error occur when you use smollm or gemma2 2b?
Author
Owner

@remy415 commented on GitHub (Sep 12, 2024):

Also please run it with debug enabled:

OLLAMA_DEBUG="1" ollama run <model>

<!-- gh-comment-id:2347408525 --> @remy415 commented on GitHub (Sep 12, 2024): Also please run it with debug enabled: `OLLAMA_DEBUG="1" ollama run <model> `
Author
Owner

@wilbert-vb commented on GitHub (Sep 13, 2024):

Also please run it with debug enabled:

OLLAMA_DEBUG="1" ollama run <model>

Screenshot 2024-09-12 at 20 28 03
<!-- gh-comment-id:2347847717 --> @wilbert-vb commented on GitHub (Sep 13, 2024): > Also please run it with debug enabled: > > `OLLAMA_DEBUG="1" ollama run <model> ` <img width="1021" alt="Screenshot 2024-09-12 at 20 28 03" src="https://github.com/user-attachments/assets/1c712c7a-7dc7-45f7-8c5a-c2d03989137a">
Author
Owner

@wilbert-vb commented on GitHub (Sep 13, 2024):

Did the same error occur when you use smollm or gemma2 2b?

Yes and yes

<!-- gh-comment-id:2347944491 --> @wilbert-vb commented on GitHub (Sep 13, 2024): > Did the same error occur when you use smollm or gemma2 2b? Yes and yes
Author
Owner

@soulisalmed commented on GitHub (Sep 13, 2024):

@wilbert-vb I managed to get ollama run llama3.1 to load the model and generate output.
Before compiling ollama using https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 , I tried to install it using curl -fsSL https://ollama.com/install.sh | sh.
The normal installation process copies some generic libraries in /usr/local/lib/ollama:

user@ubuntu:/usr/local/lib/ollama$ ls -lah
total 872M
drwxr-xr-x 2 utilisateur utilisateur 4.0K Sep  8 10:14 .
drwxr-xr-x 5 utilisateur utilisateur 4.0K Sep 13 09:03 ..
lrwxrwxrwx 1 utilisateur utilisateur   17 Feb 28  2024 libcublasLt.so -> libcublasLt.so.12
lrwxrwxrwx 1 utilisateur utilisateur   25 May  4  2021 libcublasLt.so.11 -> libcublasLt.so.11.5.1.109
-rwxr-xr-x 1 utilisateur utilisateur 235M May  4  2021 libcublasLt.so.11.5.1.109
lrwxrwxrwx 1 utilisateur utilisateur   26 Feb 28  2024 libcublasLt.so.12 -> ./libcublasLt.so.12.4.2.65
-rwxr-xr-x 1 utilisateur utilisateur 406M Feb 28  2024 libcublasLt.so.12.4.2.65
lrwxrwxrwx 1 utilisateur utilisateur   15 Feb 28  2024 libcublas.so -> libcublas.so.12
lrwxrwxrwx 1 utilisateur utilisateur   23 May  4  2021 libcublas.so.11 -> libcublas.so.11.5.1.109
-rwxr-xr-x 1 utilisateur utilisateur 121M May  4  2021 libcublas.so.11.5.1.109
lrwxrwxrwx 1 utilisateur utilisateur   24 Feb 28  2024 libcublas.so.12 -> ./libcublas.so.12.4.2.65
-rwxr-xr-x 1 utilisateur utilisateur 111M Feb 28  2024 libcublas.so.12.4.2.65
lrwxrwxrwx 1 utilisateur utilisateur   15 Feb 28  2024 libcudart.so -> libcudart.so.12
lrwxrwxrwx 1 utilisateur utilisateur   21 May  4  2021 libcudart.so.11.0 -> libcudart.so.11.3.109
-rwxr-xr-x 1 utilisateur utilisateur 624K May  4  2021 libcudart.so.11.3.109
lrwxrwxrwx 1 utilisateur utilisateur   20 Feb 28  2024 libcudart.so.12 -> libcudart.so.12.4.99
-rwxr-xr-x 1 utilisateur utilisateur 680K Feb 28  2024 libcudart.so.12.4.99

Those are not compatible with the Jetpack 6 version of cuda/cublas etc..

It seems the compiled ollama binary is using those in priority before using the ones of LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include.

By changing the directory name or deleting it, it works now :

sudo mv /usr/local/lib/ollama /usr/local/lib/ollama_stock

or

sudo rm -r /usr/local/lib/ollama
Logs :
Sep 13 09:18:56 ubuntu systemd[1]: Started Ollama Service.
Sep 13 09:18:56 ubuntu ollama[17505]: 2024/09/13 09:18:56 routes.go:1151: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:753 msg="total blobs: 5"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
Sep 13 09:18:56 ubuntu ollama[17505]:  - using env:        export GIN_MODE=release
Sep 13 09:18:56 ubuntu ollama[17505]:  - using code:        gin.SetMode(gin.ReleaseMode)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=routes.go:1198 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1163128731/runners
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libggml.so.gz
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libllama.so.gz
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/ollama_llama_server.gz
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12]"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.291+02:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1]
Sep 13 09:19:01 ubuntu ollama[17505]: CUDA driver version: 12.2
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.424+02:00 level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA totalMem 62841 mb
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA freeMem 52792 mb
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] Compute Capability 8.7
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu"
Sep 13 09:19:01 ubuntu ollama[17505]: releasing cuda driver library
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce library=cuda variant=jetpack6 compute=8.7 driver=12.2 name=Orin total="61.4 GiB" available="51.6 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 |     104.503µs |       127.0.0.1 | HEAD     "/"
Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 |   30.521045ms |       127.0.0.1 | POST     "/api/show"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.309+02:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="61.4 GiB" before.free="51.8 GiB" before.free_swap="30.7 GiB" now.total="61.4 GiB" now.free="51.8 GiB" now.free_swap="30.7 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: CUDA driver version: 12.2
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.559+02:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce name=Orin overhead="0 B" before.total="61.4 GiB" before.free="51.6 GiB" now.total="61.4 GiB" now.free="51.5 GiB" now.used="9.8 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: releasing cuda driver library
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.560+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7e6d00 gpu_count=1
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.657+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.658+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.660+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce parallel=4 available=55335387136 required="6.2 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=INFO source=server.go:101 msg="system memory" total="61.4 GiB" free="51.8 GiB" free_swap="30.7 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.662+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[51.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44895"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=DEBUG source=server.go:408 msg=subprocess environment="[PATH=/home/utilisateur/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin LD_LIBRARY_PATH=/tmp/ollama1163128731/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=server.go:590 msg="waiting for llama runner to start responding"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.672+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server error"
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] build info | build=3661 commit="8962422b" tid="281472762038336" timestamp=1726211945
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="281472762038336" timestamp=1726211945 total_threads=8
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44895" tid="281472762038336" timestamp=1726211945
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type  f32:   66 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q4_0:  225 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q6_K:    1 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.924+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model"
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: special tokens cache size = 256
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: arch             = llama
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab type       = BPE
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_vocab          = 128256
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_merges         = 280147
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab_only       = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_train      = 131072
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd           = 4096
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_layer          = 32
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head           = 32
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head_kv        = 8
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_rot            = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_swa            = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_k    = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_v    = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_gqa            = 4
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ff             = 14336
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert         = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert_used    = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: causal attn      = 1
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: pooling type     = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope type        = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope scaling     = linear
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_scale_train = 1
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope_finetuned   = unknown
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_conv       = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_inner      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_state      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model type       = 8B
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model ftype      = Q4_0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model params     = 8.03 B
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: max token length = 256
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: found 1 CUDA devices:
Sep 13 09:19:06 ubuntu ollama[17505]:   Device 0: Orin, compute capability 8.7, VMM: yes
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_tensors: ggml ctx size =    0.27 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors:        CPU buffer size =   281.81 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.182+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.06"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.433+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.44"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.684+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.76"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.936+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.99"
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ctx      = 8192
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_batch    = 512
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ubatch   = 512
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: flash_attn = 0
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_base  = 500000.0
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_scale = 1
Sep 13 09:19:08 ubuntu ollama[17505]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.187+02:00 level=DEBUG source=server.go:635 msg="model load progress 1.00"
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph nodes  = 1030
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph splits = 2
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] initializing slots | n_slots=4 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: INFO [main] model loaded | tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=INFO source=server.go:629 msg="llama runner started in 2.78 seconds"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:08 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:08 | 200 |  3.172840134s |       127.0.0.1 | POST     "/api/generate"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:466 msg="context for request finished"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.899+02:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.906+02:00 level=DEBUG source=routes.go:1415 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=11 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] prompt eval time     =     385.58 ms /    11 tokens (   35.05 ms per token,    28.53 tokens per second) | n_prompt_tokens_processed=11 n_tokens_second=28.528450645780385 slot_id=0 t_prompt_processing=385.58 t_token=35.052727272727275 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] generation eval time =     947.29 ms /    10 runs   (   94.73 ms per token,    10.56 tokens per second) | n_decoded=10 n_tokens_second=10.55638481822961 slot_id=0 t_token=94.7294 t_token_generation=947.294 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings]           total time =    1332.87 ms | slot_id=0 t_prompt_processing=385.58 t_token_generation=947.294 t_total=1332.874 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [update_slots] slot released | n_cache_tokens=21 n_ctx=8192 n_past=20 n_system_tokens=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211956 truncated=false
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46016 status=200 tid="281472054396992" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:16 | 200 |  1.460359687s |       127.0.0.1 | POST     "/api/chat"
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.287+02:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0

Details about my config :
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 [L4T 36.3.0]
NV Power Mode[2]: MODE_30W
Hardware:

  • P-Number: p3701-0005
  • Module: NVIDIA Jetson AGX Orin (64GB ram)

Platform:

  • Distribution: Ubuntu 22.04 Jammy Jellyfish
  • Release: 5.15.136-tegra

Libraries:

  • CUDA: 12.2.140
  • cuDNN: 8.9.4.25
  • TensorRT: 8.6.2.3
  • VPI: 3.1.5
  • Vulkan: 1.3.204
  • OpenCV: 4.8.0 - with CUDA: NO
<!-- gh-comment-id:2348261549 --> @soulisalmed commented on GitHub (Sep 13, 2024): @wilbert-vb I managed to get `ollama run llama3.1` to load the model and generate output. Before compiling ollama using https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 , I tried to install it using `curl -fsSL https://ollama.com/install.sh | sh`. The normal installation process copies some generic libraries in `/usr/local/lib/ollama`: ``` user@ubuntu:/usr/local/lib/ollama$ ls -lah total 872M drwxr-xr-x 2 utilisateur utilisateur 4.0K Sep 8 10:14 . drwxr-xr-x 5 utilisateur utilisateur 4.0K Sep 13 09:03 .. lrwxrwxrwx 1 utilisateur utilisateur 17 Feb 28 2024 libcublasLt.so -> libcublasLt.so.12 lrwxrwxrwx 1 utilisateur utilisateur 25 May 4 2021 libcublasLt.so.11 -> libcublasLt.so.11.5.1.109 -rwxr-xr-x 1 utilisateur utilisateur 235M May 4 2021 libcublasLt.so.11.5.1.109 lrwxrwxrwx 1 utilisateur utilisateur 26 Feb 28 2024 libcublasLt.so.12 -> ./libcublasLt.so.12.4.2.65 -rwxr-xr-x 1 utilisateur utilisateur 406M Feb 28 2024 libcublasLt.so.12.4.2.65 lrwxrwxrwx 1 utilisateur utilisateur 15 Feb 28 2024 libcublas.so -> libcublas.so.12 lrwxrwxrwx 1 utilisateur utilisateur 23 May 4 2021 libcublas.so.11 -> libcublas.so.11.5.1.109 -rwxr-xr-x 1 utilisateur utilisateur 121M May 4 2021 libcublas.so.11.5.1.109 lrwxrwxrwx 1 utilisateur utilisateur 24 Feb 28 2024 libcublas.so.12 -> ./libcublas.so.12.4.2.65 -rwxr-xr-x 1 utilisateur utilisateur 111M Feb 28 2024 libcublas.so.12.4.2.65 lrwxrwxrwx 1 utilisateur utilisateur 15 Feb 28 2024 libcudart.so -> libcudart.so.12 lrwxrwxrwx 1 utilisateur utilisateur 21 May 4 2021 libcudart.so.11.0 -> libcudart.so.11.3.109 -rwxr-xr-x 1 utilisateur utilisateur 624K May 4 2021 libcudart.so.11.3.109 lrwxrwxrwx 1 utilisateur utilisateur 20 Feb 28 2024 libcudart.so.12 -> libcudart.so.12.4.99 -rwxr-xr-x 1 utilisateur utilisateur 680K Feb 28 2024 libcudart.so.12.4.99 ``` Those are not compatible with the Jetpack 6 version of cuda/cublas etc.. It seems the compiled ollama binary is using those in priority before using the ones of `LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include`. By changing the directory name or deleting it, it works now : ``` sudo mv /usr/local/lib/ollama /usr/local/lib/ollama_stock ``` or ``` sudo rm -r /usr/local/lib/ollama ``` <details> <summary>Logs : </summary> ``` Sep 13 09:18:56 ubuntu systemd[1]: Started Ollama Service. Sep 13 09:18:56 ubuntu ollama[17505]: 2024/09/13 09:18:56 routes.go:1151: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:753 msg="total blobs: 5" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. Sep 13 09:18:56 ubuntu ollama[17505]: - using env: export GIN_MODE=release Sep 13 09:18:56 ubuntu ollama[17505]: - using code: gin.SetMode(gin.ReleaseMode) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=routes.go:1198 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1163128731/runners Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libggml.so.gz Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libllama.so.gz Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/ollama_llama_server.gz Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12]" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.291+02:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1] Sep 13 09:19:01 ubuntu ollama[17505]: CUDA driver version: 12.2 Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.424+02:00 level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1 Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA totalMem 62841 mb Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA freeMem 52792 mb Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] Compute Capability 8.7 Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu" Sep 13 09:19:01 ubuntu ollama[17505]: releasing cuda driver library Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce library=cuda variant=jetpack6 compute=8.7 driver=12.2 name=Orin total="61.4 GiB" available="51.6 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 | 104.503µs | 127.0.0.1 | HEAD "/" Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 | 30.521045ms | 127.0.0.1 | POST "/api/show" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.309+02:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="61.4 GiB" before.free="51.8 GiB" before.free_swap="30.7 GiB" now.total="61.4 GiB" now.free="51.8 GiB" now.free_swap="30.7 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: CUDA driver version: 12.2 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.559+02:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce name=Orin overhead="0 B" before.total="61.4 GiB" before.free="51.6 GiB" now.total="61.4 GiB" now.free="51.5 GiB" now.used="9.8 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: releasing cuda driver library Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.560+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7e6d00 gpu_count=1 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.657+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.658+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.660+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce parallel=4 available=55335387136 required="6.2 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=INFO source=server.go:101 msg="system memory" total="61.4 GiB" free="51.8 GiB" free_swap="30.7 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.662+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[51.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44895" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=DEBUG source=server.go:408 msg=subprocess environment="[PATH=/home/utilisateur/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin LD_LIBRARY_PATH=/tmp/ollama1163128731/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=server.go:590 msg="waiting for llama runner to start responding" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.672+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server error" Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] build info | build=3661 commit="8962422b" tid="281472762038336" timestamp=1726211945 Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="281472762038336" timestamp=1726211945 total_threads=8 Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44895" tid="281472762038336" timestamp=1726211945 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 0: general.architecture str = llama Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 1: general.type str = model Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 5: general.size_label str = 8B Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type f32: 66 tensors Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q4_0: 225 tensors Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q6_K: 1 tensors Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.924+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model" Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: special tokens cache size = 256 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: format = GGUF V3 (latest) Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: arch = llama Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab type = BPE Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_vocab = 128256 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_merges = 280147 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab_only = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_train = 131072 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd = 4096 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_layer = 32 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head = 32 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head_kv = 8 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_rot = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_swa = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_k = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_v = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_gqa = 4 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ff = 14336 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert_used = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: causal attn = 1 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: pooling type = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope type = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope scaling = linear Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_base_train = 500000.0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_scale_train = 1 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope_finetuned = unknown Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_conv = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_inner = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_state = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_rank = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model type = 8B Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model ftype = Q4_0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model params = 8.03 B Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: LF token = 128 'Ä' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: max token length = 256 Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: found 1 CUDA devices: Sep 13 09:19:06 ubuntu ollama[17505]: Device 0: Orin, compute capability 8.7, VMM: yes Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_tensors: ggml ctx size = 0.27 MiB Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading non-repeating layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: CPU buffer size = 281.81 MiB Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: CUDA0 buffer size = 4156.00 MiB Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.182+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.06" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.433+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.44" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.684+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.76" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.936+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.99" Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ctx = 8192 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_batch = 512 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ubatch = 512 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: flash_attn = 0 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_base = 500000.0 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_scale = 1 Sep 13 09:19:08 ubuntu ollama[17505]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.187+02:00 level=DEBUG source=server.go:635 msg="model load progress 1.00" Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph nodes = 1030 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph splits = 2 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] initializing slots | n_slots=4 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: INFO [main] model loaded | tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=INFO source=server.go:629 msg="llama runner started in 2.78 seconds" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:08 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:08 | 200 | 3.172840134s | 127.0.0.1 | POST "/api/generate" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:466 msg="context for request finished" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.899+02:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.906+02:00 level=DEBUG source=routes.go:1415 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=11 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] prompt eval time = 385.58 ms / 11 tokens ( 35.05 ms per token, 28.53 tokens per second) | n_prompt_tokens_processed=11 n_tokens_second=28.528450645780385 slot_id=0 t_prompt_processing=385.58 t_token=35.052727272727275 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] generation eval time = 947.29 ms / 10 runs ( 94.73 ms per token, 10.56 tokens per second) | n_decoded=10 n_tokens_second=10.55638481822961 slot_id=0 t_token=94.7294 t_token_generation=947.294 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] total time = 1332.87 ms | slot_id=0 t_prompt_processing=385.58 t_token_generation=947.294 t_total=1332.874 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [update_slots] slot released | n_cache_tokens=21 n_ctx=8192 n_past=20 n_system_tokens=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211956 truncated=false Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46016 status=200 tid="281472054396992" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:16 | 200 | 1.460359687s | 127.0.0.1 | POST "/api/chat" Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.287+02:00 level=DEBUG source=sched.go:407 msg="context for request finished" Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 ``` </details> <ins>Details about my config : </ins> **Model**: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 [L4T 36.3.0] **NV Power Mode[2]**: MODE_30W **Hardware**: - P-Number: p3701-0005 - Module: NVIDIA Jetson AGX Orin (64GB ram) **Platform**: - Distribution: Ubuntu 22.04 Jammy Jellyfish - Release: 5.15.136-tegra **Libraries**: - CUDA: 12.2.140 - cuDNN: 8.9.4.25 - TensorRT: 8.6.2.3 - VPI: 3.1.5 - Vulkan: 1.3.204 - OpenCV: 4.8.0 - with CUDA: NO
Author
Owner

@remy415 commented on GitHub (Sep 13, 2024):

@soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you.

Alternatively you can try the container approach on dustynv’s GitHub page.

<!-- gh-comment-id:2348777345 --> @remy415 commented on GitHub (Sep 13, 2024): @soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you. Alternatively you can try the container approach on dustynv’s GitHub page.
Author
Owner

@wilbert-vb commented on GitHub (Sep 13, 2024):

@soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you.

Alternatively you can try the container approach on dustynv’s GitHub page.

I can confirm that Ollama with a model loaded is responding to a prompt after renaming /usr/local/lib/ollama.
And the GPU is utilized.

Screenshot 2024-09-13 at 07 24 34

Dustynv's container seems out of date and he is less motivated to keep up with the progress that Ollama is making. He even suggested to use llama.cpp (read https://github.com/dusty-nv/jetson-containers/issues/585#issuecomment-2316016480)

Many thanks, @soulisalmed and team!

<!-- gh-comment-id:2348846136 --> @wilbert-vb commented on GitHub (Sep 13, 2024): > @soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you. > > Alternatively you can try the container approach on dustynv’s GitHub page. I can confirm that Ollama with a model loaded is responding to a prompt after renaming /usr/local/lib/ollama. And the GPU is utilized. <img width="632" alt="Screenshot 2024-09-13 at 07 24 34" src="https://github.com/user-attachments/assets/48068785-9ecf-47fd-abfd-b2385cfe8f06"> Dustynv's container seems out of date and he is less motivated to keep up with the progress that Ollama is making. He even suggested to use llama.cpp (read https://github.com/dusty-nv/jetson-containers/issues/585#issuecomment-2316016480) Many thanks, @soulisalmed and team!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2097