[GH-ISSUE #3406] Official arm64 build does not work on Jetson Nano Orin #48608

New Issue

GiteaMirror · 2026-04-28T08:56:20-05:00

GiteaMirror commented

2026-04-28 08:56:20 -05:00

Originally created by @gab0220 on GitHub (Mar 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3406

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hello everyone, thank you for your work.

I'm using a Jetson Nano Orin. Following #3098, some days ago I done a git checkout using #2279 commit and install this version on my device. It works.

Today I tried to:

Install the v0.1.30 using this tutorial
Clean ollama list
Run ollama pull <model>
Run OLLAMA_DEBUG="1" ollama run <model>
Output:

Error: Post "http://127.0.0.1:11434/api/chat": EOF

I also attach the output of journalctl -u ollama:

Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library [libcudart.so](https://libcudart.so/)*"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.692+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3349183846/runners/cuda_v11/[libcudart.so](https://libcudart.so/).11.0 /usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12.2/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140]"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.801+01:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.7"
Mar 29 11:16:17 ubuntu systemd[1]: Stopping Ollama Service...
Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Deactivated successfully.
Mar 29 11:16:17 ubuntu systemd[1]: Stopped Ollama Service.
Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Consumed 9.601s CPU time.

What did you expect to see?

So the I can't use model.

Steps to reproduce

No response

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

Other

Platform

No response

Ollama version

v0.1.30

GPU

Nvidia

GPU info

No response

CPU

No response

Other software

No response

Originally created by @gab0220 on GitHub (Mar 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3406 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hello everyone, thank you for your work. I'm using a Jetson Nano Orin. Following #3098, some days ago I done a ```git checkout``` using #2279 commit and install this version on my device. It works. Today I tried to: * Install the v0.1.30 using [this tutorial](https://github.com/ollama/ollama/blob/main/docs/tutorials/nvidia-jetson.md#running-ollama-on-nvidia-jetson-devices) * Clean ```ollama list``` * Run ```ollama pull <model>``` * Run ```OLLAMA_DEBUG="1" ollama run <model>``` Output: ``` Error: Post "http://127.0.0.1:11434/api/chat": EOF ``` I also attach the output of ```journalctl -u ollama```: ``` Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.687+01:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library [libcudart.so](https://libcudart.so/)*" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.692+01:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [/tmp/ollama3349183846/runners/cuda_v11/[libcudart.so](https://libcudart.so/).11.0 /usr/local/cuda/lib64/libcudart.so.12.2.140 /usr/local/cuda/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140 /usr/local/cuda-12.2/targets/aarch64-linux/lib/[libcudart.so](https://libcudart.so/).12.2.140]" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.714+01:00 level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions" Mar 29 11:16:09 ubuntu ollama[4168]: time=2024-03-29T11:16:09.801+01:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.7" Mar 29 11:16:17 ubuntu systemd[1]: Stopping Ollama Service... Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Deactivated successfully. Mar 29 11:16:17 ubuntu systemd[1]: Stopped Ollama Service. Mar 29 11:16:17 ubuntu systemd[1]: ollama.service: Consumed 9.601s CPU time. ``` ### What did you expect to see? So the I can't use model. ### Steps to reproduce _No response_ ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture Other ### Platform _No response_ ### Ollama version v0.1.30 ### GPU Nvidia ### GPU info _No response_ ### CPU _No response_ ### Other software _No response_

GiteaMirror added the bug nvidia labels 2026-04-28 08:56:20 -05:00

GiteaMirror closed this issue

2026-04-28 08:56:21 -05:00

GiteaMirror commented

2026-04-28 08:56:22 -05:00

@remy415 commented on GitHub (Mar 30, 2024):

@gab0220 thank you for reporting this. The issue right now is the OS Jetsons run on aren’t able to use the CUDA libraries bundled by the process they use to compile the binary. We’re still trying to pinpoint the exact issue to see if there’s a way to continue using the same process with minor adjustments.

You should be able to quickly build the binary on your Jetson, note that it is no longer necessary to follow the referenced tutorial, though it should still work if you compile yourself.

First, set up environment variables

export GOLANG_VERSION=1.21.3
export GO_ARCH=arm64
export CMAKE_VERSION=3.22.1
export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include
export OLLAMA_SKIP_CPU_GENERATE="1"
export CGO_ENABLED="1"
export CMAKE_CUDA_ARCHITECTURES="72;87"

Ensure required tools are installed

sudo apt update && sudo apt install -y build-essentials
curl -s -L https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz | tar -zx -C /usr --strip-components 1
rm /usr/local/bin/cmake && update-alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 30
curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-${GO_ARCH}.tar.gz | tar xz -C /usr/local
ln -s /usr/local/go/bin/go /usr/local/bin/go
ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt

Clone repo and build. Ensure you first cd <project folder>

git clone https://github.com/ollama/ollama.git && cd ollama
go clean
go generate ./… && go build .

This will compile the Ollama binary for your Jetson and save it to your current directory. Remove the old Ollama binarysudo rm /usr/local/bin/ollama then copy the new one withsudo cp ollama /usr/local/bin/ollama. You can then restart your Ollama service.

@remy415 commented on GitHub (Mar 30, 2024): @gab0220 thank you for reporting this. The issue right now is the OS Jetsons run on aren’t able to use the CUDA libraries bundled by the process they use to compile the binary. We’re still trying to pinpoint the exact issue to see if there’s a way to continue using the same process with minor adjustments. You should be able to quickly build the binary on your Jetson, note that it is no longer necessary to follow the referenced tutorial, though it should still work if you compile yourself. First, set up environment variables ``` export GOLANG_VERSION=1.21.3 export GO_ARCH=arm64 export CMAKE_VERSION=3.22.1 export LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include export OLLAMA_SKIP_CPU_GENERATE="1" export CGO_ENABLED="1" export CMAKE_CUDA_ARCHITECTURES="72;87" ``` Ensure required tools are installed ``` sudo apt update && sudo apt install -y build-essentials curl -s -L https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}-linux-$(uname -m).tar.gz | tar -zx -C /usr --strip-components 1 rm /usr/local/bin/cmake && update-alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake 30 curl -s -L https://dl.google.com/go/go${GOLANG_VERSION}.linux-${GO_ARCH}.tar.gz | tar xz -C /usr/local ln -s /usr/local/go/bin/go /usr/local/bin/go ln -s /usr/local/go/bin/gofmt /usr/local/bin/gofmt ``` Clone repo and build. Ensure you first `cd <project folder>` ``` git clone https://github.com/ollama/ollama.git && cd ollama go clean go generate ./… && go build . ``` This will compile the Ollama binary for your Jetson and save it to your current directory. Remove the old Ollama binary`sudo rm /usr/local/bin/ollama` then copy the new one with`sudo cp ollama /usr/local/bin/ollama`. You can then restart your Ollama service.

GiteaMirror commented

2026-04-28 08:56:22 -05:00

@dhiltgen commented on GitHub (Apr 12, 2024):

I've adjusted the behavior of the system with the upcoming 0.1.32 release so that we'll load the cuda library from the LD_LIBRARY_PATH before our bundled version, which should help mitigate this. As long as you include the cuda lib dir in your LD_LIBRARY_PATH for the ollama server, it should work. Ultimately I'd still like to get an older glibc based build setup defined that has a cuda library that works on Jetson, so I'll keep this issue open for now.

@dhiltgen commented on GitHub (Apr 12, 2024): I've adjusted the behavior of the system with the upcoming 0.1.32 release so that we'll load the cuda library from the LD_LIBRARY_PATH before our bundled version, which should help mitigate this. As long as you include the cuda lib dir in your LD_LIBRARY_PATH for the ollama server, it should work. Ultimately I'd still like to get an older glibc based build setup defined that has a cuda library that works on Jetson, so I'll keep this issue open for now.

GiteaMirror commented

2026-04-28 08:56:22 -05:00

@CesarCalvoCobo commented on GitHub (Apr 18, 2024):

Hi , thanks again all for your work

I am trying to compile new version and getting always the same error :
/usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1
/usr/bin/ld: cannot find ollama/llm/build/linux/arm64_static/libllama.a: No such file or directory

Also trying to install the bundled version directly including LD_LIBRARY_PATH and it runs but it does not load the models

@CesarCalvoCobo commented on GitHub (Apr 18, 2024): Hi , thanks again all for your work I am trying to compile new version and getting always the same error : /usr/local/go/pkg/tool/linux_arm64/link: running gcc failed: exit status 1 /usr/bin/ld: cannot find ollama/llm/build/linux/arm64_static/libllama.a: No such file or directory Also trying to install the bundled version directly including LD_LIBRARY_PATH and it runs but it does not load the models

GiteaMirror commented

2026-04-28 08:56:23 -05:00

@remy415 commented on GitHub (Apr 18, 2024):

@CesarCalvoCobo are you setting OLLAMA_SKIP_CPU_GENERATE=1? If so, you should set it to OLLAMA_SKIP_CPU_GENERATE="". I've submitted a PR to fix this but in the mean time, you need to compile the CPU and ensure you also don't set OLLAMA_CPU_TARGET

@remy415 commented on GitHub (Apr 18, 2024): @CesarCalvoCobo are you setting `OLLAMA_SKIP_CPU_GENERATE=1`? If so, you should set it to `OLLAMA_SKIP_CPU_GENERATE=""`. I've submitted a PR to fix this but in the mean time, you need to compile the CPU and ensure you also don't set `OLLAMA_CPU_TARGET`

GiteaMirror commented

2026-04-28 08:56:23 -05:00

@remy415 commented on GitHub (Apr 19, 2024):

@CesarCalvoCobo Okay my PR got merged so you should be able to just pull the latest ollama repo and run the compile again

@remy415 commented on GitHub (Apr 19, 2024): @CesarCalvoCobo Okay my PR got merged so you should be able to just pull the latest ollama repo and run the compile again

GiteaMirror commented

2026-04-28 08:56:23 -05:00

@CesarCalvoCobo commented on GitHub (Apr 19, 2024):

Thank you so much @remy415 - I compiled it succesfully now

@CesarCalvoCobo commented on GitHub (Apr 19, 2024): Thank you so much @remy415 - I compiled it succesfully now

GiteaMirror commented

2026-04-28 08:56:24 -05:00

@remy415 commented on GitHub (May 2, 2024):

@dhiltgen yea everything is working well as of a couple weeks ago

@remy415 commented on GitHub (May 2, 2024): @dhiltgen yea everything is working well as of a couple weeks ago

GiteaMirror commented

2026-04-28 08:56:25 -05:00

@dhiltgen commented on GitHub (May 21, 2024):

Sounds like we can close this as resolved. Please speak up if you have any lingering issues on Jetsons.

@dhiltgen commented on GitHub (May 21, 2024): Sounds like we can close this as resolved. Please speak up if you have any lingering issues on Jetsons.

GiteaMirror commented

2026-04-28 08:56:26 -05:00

@wilbert-vb commented on GitHub (Sep 12, 2024):

I have followed the instructions in: https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618

The build finished with success.

When running 'Ollama run {model}' I get the following error:

Error: llama runner process has terminated: CUDA error: the resource allocation failed current device: 0, in function cublas_handle at /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda/common.cuh:644 cublasCreate_v2(&cublas_handles[device]) /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error

How would I solve this?

@wilbert-vb commented on GitHub (Sep 12, 2024): I have followed the instructions in: https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 The build finished with success. When running 'Ollama run {model}' I get the following error: `Error: llama runner process has terminated: CUDA error: the resource allocation failed current device: 0, in function cublas_handle at /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda/common.cuh:644 cublasCreate_v2(&cublas_handles[device]) /home/wilbertvanbakel/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:102: CUDA error` How would I solve this?

GiteaMirror commented

2026-04-28 08:56:26 -05:00

@soulisalmed commented on GitHub (Sep 12, 2024):

Yes Same error here on Jetson AGX Orin 64gb.

@soulisalmed commented on GitHub (Sep 12, 2024): Yes Same error here on Jetson AGX Orin 64gb.

GiteaMirror commented

2026-04-28 08:56:26 -05:00

@remy415 commented on GitHub (Sep 12, 2024):

@wilbert-vb @soulisalmed
When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.

@remy415 commented on GitHub (Sep 12, 2024): @wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.

GiteaMirror commented

2026-04-28 08:56:27 -05:00

@wilbert-vb commented on GitHub (Sep 12, 2024):

@wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc.

Hardware:

NVidia Jetson Orin NX 16GB, Carrier Board: ReComputer J401

Jetson_release:

Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit - Jetpack 6.0 [L4T 36.3.0]
NV Power Mode[1]: 10W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:

P-Number: p3767-0000

Module: NVIDIA Jetson Orin NX (16GB ram)
Platform:

Distribution: Ubuntu 22.04 Jammy Jellyfish

Release: 5.15.136-tegra
jtop:

Version: 4.2.9
< - Service: Active
Libraries:

CUDA: 12.2.140

cuDNN: 8.9.4.25

TensorRT: Not installed

VPI: 3.1.5

Vulkan: 1.3.204

OpenCV: 4.8.0 - with CUDA: NO

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
go version go1.21.3 linux/arm64

@wilbert-vb commented on GitHub (Sep 12, 2024): > @wilbert-vb @soulisalmed When reporting issues it’s important to share environment details in order to better assess the issue. Please provide operating system (Jetpack distribution and Linux distribution version), and relevant software package versions: GCC compiler, nVidia CUDA native version, CUDA installed version, Golang version, etc. Hardware: > NVidia Jetson Orin NX 16GB, Carrier Board: ReComputer J401 Jetson_release: > Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit - Jetpack 6.0 [L4T 36.3.0] NV Power Mode[1]: 10W Serial Number: [XXX Show with: jetson_release -s XXX] Hardware: > - P-Number: p3767-0000 > - Module: NVIDIA Jetson Orin NX (16GB ram) Platform: > - Distribution: Ubuntu 22.04 Jammy Jellyfish > - Release: 5.15.136-tegra jtop: > - Version: 4.2.9 < - Service: Active Libraries: > - CUDA: 12.2.140 > - cuDNN: 8.9.4.25 > - TensorRT: Not installed > - VPI: 3.1.5 > - Vulkan: 1.3.204 > - OpenCV: 4.8.0 - with CUDA: NO > gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 > go version go1.21.3 linux/arm64

GiteaMirror commented

2026-04-28 08:56:27 -05:00

@remy415 commented on GitHub (Sep 12, 2024):

@wilbert-vb what is the size (Gb) of the model you are trying to use?

quick search suggests it’s related to device OOM

@remy415 commented on GitHub (Sep 12, 2024): @wilbert-vb what is the size (Gb) of the model you are trying to use? quick search suggests it’s related to device OOM

GiteaMirror commented

2026-04-28 08:56:28 -05:00

@wilbert-vb commented on GitHub (Sep 12, 2024):

@wilbert-vb what is the size (Gb) of the model you are trying to use?

quick search suggests it’s related to device OOM

Not sure that I understand your question.
Memory is 16GB
Storage is 500GB

Sorry:

mistral:latest f974a74358d6 4.1 GB 2 weeks ago
smollm:latest 95f6557a0f0f 990 MB 2 weeks ago
phi3.5:latest 3b387c8dd9b7 2.2 GB 2 weeks ago
gemma2:latest ff02c3702f32 5.4 GB 4 weeks ago
llama3.1:latest c4a76fe0c601 4.9 GB 4 weeks ago
openchat:latest 537a4e03b649 4.1 GB 4 weeks ago
gemma2:2b 8ccf136fdd52 1.6 GB 5 weeks ago
qwen2:latest e0d4e1163c58 4.4 GB 7 weeks ago

@wilbert-vb commented on GitHub (Sep 12, 2024): > @wilbert-vb what is the size (Gb) of the model you are trying to use? > > quick search suggests it’s related to device OOM Not sure that I understand your question. Memory is 16GB Storage is 500GB Sorry: mistral:latest f974a74358d6 4.1 GB 2 weeks ago smollm:latest 95f6557a0f0f 990 MB 2 weeks ago phi3.5:latest 3b387c8dd9b7 2.2 GB 2 weeks ago gemma2:latest ff02c3702f32 5.4 GB 4 weeks ago llama3.1:latest c4a76fe0c601 4.9 GB 4 weeks ago openchat:latest 537a4e03b649 4.1 GB 4 weeks ago gemma2:2b 8ccf136fdd52 1.6 GB 5 weeks ago qwen2:latest e0d4e1163c58 4.4 GB 7 weeks ago

GiteaMirror commented

2026-04-28 08:56:28 -05:00

@remy415 commented on GitHub (Sep 12, 2024):

Did the same error occur when you use smollm or gemma2 2b?

@remy415 commented on GitHub (Sep 12, 2024): Did the same error occur when you use smollm or gemma2 2b?

GiteaMirror commented

2026-04-28 08:56:28 -05:00

@remy415 commented on GitHub (Sep 12, 2024):

Also please run it with debug enabled:

OLLAMA_DEBUG="1" ollama run <model>

@remy415 commented on GitHub (Sep 12, 2024): Also please run it with debug enabled: `OLLAMA_DEBUG="1" ollama run <model> `

GiteaMirror commented

2026-04-28 08:56:28 -05:00

@wilbert-vb commented on GitHub (Sep 13, 2024):

Also please run it with debug enabled:

OLLAMA_DEBUG="1" ollama run <model>

@wilbert-vb commented on GitHub (Sep 13, 2024): > Also please run it with debug enabled: > > `OLLAMA_DEBUG="1" ollama run <model> ` <img width="1021" alt="Screenshot 2024-09-12 at 20 28 03" src="https://github.com/user-attachments/assets/1c712c7a-7dc7-45f7-8c5a-c2d03989137a">

GiteaMirror commented

2026-04-28 08:56:30 -05:00

@wilbert-vb commented on GitHub (Sep 13, 2024):

Did the same error occur when you use smollm or gemma2 2b?

Yes and yes

@wilbert-vb commented on GitHub (Sep 13, 2024): > Did the same error occur when you use smollm or gemma2 2b? Yes and yes

GiteaMirror commented

2026-04-28 08:56:30 -05:00

@soulisalmed commented on GitHub (Sep 13, 2024):

@wilbert-vb I managed to get ollama run llama3.1 to load the model and generate output.
Before compiling ollama using https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 , I tried to install it using curl -fsSL https://ollama.com/install.sh | sh.
The normal installation process copies some generic libraries in /usr/local/lib/ollama:

user@ubuntu:/usr/local/lib/ollama$ ls -lah
total 872M
drwxr-xr-x 2 utilisateur utilisateur 4.0K Sep  8 10:14 .
drwxr-xr-x 5 utilisateur utilisateur 4.0K Sep 13 09:03 ..
lrwxrwxrwx 1 utilisateur utilisateur   17 Feb 28  2024 libcublasLt.so -> libcublasLt.so.12
lrwxrwxrwx 1 utilisateur utilisateur   25 May  4  2021 libcublasLt.so.11 -> libcublasLt.so.11.5.1.109
-rwxr-xr-x 1 utilisateur utilisateur 235M May  4  2021 libcublasLt.so.11.5.1.109
lrwxrwxrwx 1 utilisateur utilisateur   26 Feb 28  2024 libcublasLt.so.12 -> ./libcublasLt.so.12.4.2.65
-rwxr-xr-x 1 utilisateur utilisateur 406M Feb 28  2024 libcublasLt.so.12.4.2.65
lrwxrwxrwx 1 utilisateur utilisateur   15 Feb 28  2024 libcublas.so -> libcublas.so.12
lrwxrwxrwx 1 utilisateur utilisateur   23 May  4  2021 libcublas.so.11 -> libcublas.so.11.5.1.109
-rwxr-xr-x 1 utilisateur utilisateur 121M May  4  2021 libcublas.so.11.5.1.109
lrwxrwxrwx 1 utilisateur utilisateur   24 Feb 28  2024 libcublas.so.12 -> ./libcublas.so.12.4.2.65
-rwxr-xr-x 1 utilisateur utilisateur 111M Feb 28  2024 libcublas.so.12.4.2.65
lrwxrwxrwx 1 utilisateur utilisateur   15 Feb 28  2024 libcudart.so -> libcudart.so.12
lrwxrwxrwx 1 utilisateur utilisateur   21 May  4  2021 libcudart.so.11.0 -> libcudart.so.11.3.109
-rwxr-xr-x 1 utilisateur utilisateur 624K May  4  2021 libcudart.so.11.3.109
lrwxrwxrwx 1 utilisateur utilisateur   20 Feb 28  2024 libcudart.so.12 -> libcudart.so.12.4.99
-rwxr-xr-x 1 utilisateur utilisateur 680K Feb 28  2024 libcudart.so.12.4.99

Those are not compatible with the Jetpack 6 version of cuda/cublas etc..

It seems the compiled ollama binary is using those in priority before using the ones of LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include.

By changing the directory name or deleting it, it works now :

sudo mv /usr/local/lib/ollama /usr/local/lib/ollama_stock

or

sudo rm -r /usr/local/lib/ollama

Logs :

Sep 13 09:18:56 ubuntu systemd[1]: Started Ollama Service.
Sep 13 09:18:56 ubuntu ollama[17505]: 2024/09/13 09:18:56 routes.go:1151: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:753 msg="total blobs: 5"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
Sep 13 09:18:56 ubuntu ollama[17505]:  - using env:        export GIN_MODE=release
Sep 13 09:18:56 ubuntu ollama[17505]:  - using code:        gin.SetMode(gin.ReleaseMode)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=routes.go:1198 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1163128731/runners
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libggml.so.gz
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libllama.so.gz
Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/ollama_llama_server.gz
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12]"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.291+02:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1]
Sep 13 09:19:01 ubuntu ollama[17505]: CUDA driver version: 12.2
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.424+02:00 level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA totalMem 62841 mb
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA freeMem 52792 mb
Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] Compute Capability 8.7
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu"
Sep 13 09:19:01 ubuntu ollama[17505]: releasing cuda driver library
Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce library=cuda variant=jetpack6 compute=8.7 driver=12.2 name=Orin total="61.4 GiB" available="51.6 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 |     104.503µs |       127.0.0.1 | HEAD     "/"
Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 |   30.521045ms |       127.0.0.1 | POST     "/api/show"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.309+02:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="61.4 GiB" before.free="51.8 GiB" before.free_swap="30.7 GiB" now.total="61.4 GiB" now.free="51.8 GiB" now.free_swap="30.7 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: CUDA driver version: 12.2
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.559+02:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce name=Orin overhead="0 B" before.total="61.4 GiB" before.free="51.6 GiB" now.total="61.4 GiB" now.free="51.5 GiB" now.used="9.8 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: releasing cuda driver library
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.560+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7e6d00 gpu_count=1
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.657+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.658+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.660+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce parallel=4 available=55335387136 required="6.2 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=INFO source=server.go:101 msg="system memory" total="61.4 GiB" free="51.8 GiB" free_swap="30.7 GiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.662+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[51.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44895"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=DEBUG source=server.go:408 msg=subprocess environment="[PATH=/home/utilisateur/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin LD_LIBRARY_PATH=/tmp/ollama1163128731/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce]"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=server.go:590 msg="waiting for llama runner to start responding"
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.672+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server error"
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] build info | build=3661 commit="8962422b" tid="281472762038336" timestamp=1726211945
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="281472762038336" timestamp=1726211945 total_threads=8
Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44895" tid="281472762038336" timestamp=1726211945
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   1:                               general.type str              = model
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   5:                         general.size_label str              = 8B
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   6:                            general.license str              = llama3.1
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv   9:                          llama.block_count u32              = 32
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  17:                          general.file_type u32              = 2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type  f32:   66 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q4_0:  225 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q6_K:    1 tensors
Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.924+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model"
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: special tokens cache size = 256
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: arch             = llama
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab type       = BPE
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_vocab          = 128256
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_merges         = 280147
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab_only       = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_train      = 131072
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd           = 4096
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_layer          = 32
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head           = 32
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head_kv        = 8
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_rot            = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_swa            = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_k    = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_v    = 128
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_gqa            = 4
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ff             = 14336
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert         = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert_used    = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: causal attn      = 1
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: pooling type     = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope type        = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope scaling     = linear
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_base_train  = 500000.0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_scale_train = 1
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope_finetuned   = unknown
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_conv       = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_inner      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_state      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model type       = 8B
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model ftype      = Q4_0
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model params     = 8.03 B
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: LF token         = 128 'Ä'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: max token length = 256
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: found 1 CUDA devices:
Sep 13 09:19:06 ubuntu ollama[17505]:   Device 0: Orin, compute capability 8.7, VMM: yes
Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_tensors: ggml ctx size =    0.27 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors:        CPU buffer size =   281.81 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.182+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.06"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.433+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.44"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.684+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.76"
Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.936+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.99"
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ctx      = 8192
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_batch    = 512
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ubatch   = 512
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: flash_attn = 0
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_base  = 500000.0
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_scale = 1
Sep 13 09:19:08 ubuntu ollama[17505]: llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.187+02:00 level=DEBUG source=server.go:635 msg="model load progress 1.00"
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph nodes  = 1030
Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph splits = 2
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] initializing slots | n_slots=4 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: INFO [main] model loaded | tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="281472762038336" timestamp=1726211948
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=INFO source=server.go:629 msg="llama runner started in 2.78 seconds"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:08 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:08 | 200 |  3.172840134s |       127.0.0.1 | POST     "/api/generate"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:466 msg="context for request finished"
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.899+02:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.906+02:00 level=DEBUG source=routes.go:1415 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=11 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] prompt eval time     =     385.58 ms /    11 tokens (   35.05 ms per token,    28.53 tokens per second) | n_prompt_tokens_processed=11 n_tokens_second=28.528450645780385 slot_id=0 t_prompt_processing=385.58 t_token=35.052727272727275 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] generation eval time =     947.29 ms /    10 runs   (   94.73 ms per token,    10.56 tokens per second) | n_decoded=10 n_tokens_second=10.55638481822961 slot_id=0 t_token=94.7294 t_token_generation=947.294 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings]           total time =    1332.87 ms | slot_id=0 t_prompt_processing=385.58 t_token_generation=947.294 t_total=1332.874 task_id=3 tid="281472762038336" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [update_slots] slot released | n_cache_tokens=21 n_ctx=8192 n_past=20 n_system_tokens=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211956 truncated=false
Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46016 status=200 tid="281472054396992" timestamp=1726211956
Sep 13 09:19:16 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:16 | 200 |  1.460359687s |       127.0.0.1 | POST     "/api/chat"
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.287+02:00 level=DEBUG source=sched.go:407 msg="context for request finished"
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0

Details about my config :
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 [L4T 36.3.0]
NV Power Mode[2]: MODE_30W
Hardware:

P-Number: p3701-0005
Module: NVIDIA Jetson AGX Orin (64GB ram)

Platform:

Distribution: Ubuntu 22.04 Jammy Jellyfish
Release: 5.15.136-tegra

Libraries:

CUDA: 12.2.140
cuDNN: 8.9.4.25
TensorRT: 8.6.2.3
VPI: 3.1.5
Vulkan: 1.3.204
OpenCV: 4.8.0 - with CUDA: NO

@soulisalmed commented on GitHub (Sep 13, 2024): @wilbert-vb I managed to get `ollama run llama3.1` to load the model and generate output. Before compiling ollama using https://github.com/ollama/ollama/issues/3406#issuecomment-2028118618 , I tried to install it using `curl -fsSL https://ollama.com/install.sh | sh`. The normal installation process copies some generic libraries in `/usr/local/lib/ollama`: ``` user@ubuntu:/usr/local/lib/ollama$ ls -lah total 872M drwxr-xr-x 2 utilisateur utilisateur 4.0K Sep 8 10:14 . drwxr-xr-x 5 utilisateur utilisateur 4.0K Sep 13 09:03 .. lrwxrwxrwx 1 utilisateur utilisateur 17 Feb 28 2024 libcublasLt.so -> libcublasLt.so.12 lrwxrwxrwx 1 utilisateur utilisateur 25 May 4 2021 libcublasLt.so.11 -> libcublasLt.so.11.5.1.109 -rwxr-xr-x 1 utilisateur utilisateur 235M May 4 2021 libcublasLt.so.11.5.1.109 lrwxrwxrwx 1 utilisateur utilisateur 26 Feb 28 2024 libcublasLt.so.12 -> ./libcublasLt.so.12.4.2.65 -rwxr-xr-x 1 utilisateur utilisateur 406M Feb 28 2024 libcublasLt.so.12.4.2.65 lrwxrwxrwx 1 utilisateur utilisateur 15 Feb 28 2024 libcublas.so -> libcublas.so.12 lrwxrwxrwx 1 utilisateur utilisateur 23 May 4 2021 libcublas.so.11 -> libcublas.so.11.5.1.109 -rwxr-xr-x 1 utilisateur utilisateur 121M May 4 2021 libcublas.so.11.5.1.109 lrwxrwxrwx 1 utilisateur utilisateur 24 Feb 28 2024 libcublas.so.12 -> ./libcublas.so.12.4.2.65 -rwxr-xr-x 1 utilisateur utilisateur 111M Feb 28 2024 libcublas.so.12.4.2.65 lrwxrwxrwx 1 utilisateur utilisateur 15 Feb 28 2024 libcudart.so -> libcudart.so.12 lrwxrwxrwx 1 utilisateur utilisateur 21 May 4 2021 libcudart.so.11.0 -> libcudart.so.11.3.109 -rwxr-xr-x 1 utilisateur utilisateur 624K May 4 2021 libcudart.so.11.3.109 lrwxrwxrwx 1 utilisateur utilisateur 20 Feb 28 2024 libcudart.so.12 -> libcudart.so.12.4.99 -rwxr-xr-x 1 utilisateur utilisateur 680K Feb 28 2024 libcudart.so.12.4.99 ``` Those are not compatible with the Jetpack 6 version of cuda/cublas etc.. It seems the compiled ollama binary is using those in priority before using the ones of `LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64/:/usr/local/cuda/include`. By changing the directory name or deleting it, it works now : ``` sudo mv /usr/local/lib/ollama /usr/local/lib/ollama_stock ``` or ``` sudo rm -r /usr/local/lib/ollama ``` <details> <summary>Logs : </summary> ``` Sep 13 09:18:56 ubuntu systemd[1]: Started Ollama Service. Sep 13 09:18:56 ubuntu ollama[17505]: 2024/09/13 09:18:56 routes.go:1151: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:753 msg="total blobs: 5" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.859+02:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0" Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. Sep 13 09:18:56 ubuntu ollama[17505]: - using env: export GIN_MODE=release Sep 13 09:18:56 ubuntu ollama[17505]: - using code: gin.SetMode(gin.ReleaseMode) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=routes.go:1198 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1163128731/runners Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libggml.so.gz Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/libllama.so.gz Sep 13 09:18:56 ubuntu ollama[17505]: time=2024-09-13T09:18:56.860+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/arm64/cuda_v12/bin/ollama_llama_server.gz Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cuda_v12]" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.284+02:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=WARN source=gpu.go:669 msg="unable to locate gpu dependency libraries" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.285+02:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.291+02:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1] Sep 13 09:19:01 ubuntu ollama[17505]: CUDA driver version: 12.2 Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.424+02:00 level=DEBUG source=gpu.go:119 msg="detected GPUs" count=1 library=/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1 Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA totalMem 62841 mb Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] CUDA freeMem 52792 mb Sep 13 09:19:01 ubuntu ollama[17505]: [GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce] Compute Capability 8.7 Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=DEBUG source=amd_linux.go:376 msg="amdgpu driver not detected /sys/module/amdgpu" Sep 13 09:19:01 ubuntu ollama[17505]: releasing cuda driver library Sep 13 09:19:01 ubuntu ollama[17505]: time=2024-09-13T09:19:01.543+02:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce library=cuda variant=jetpack6 compute=8.7 driver=12.2 name=Orin total="61.4 GiB" available="51.6 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 | 104.503µs | 127.0.0.1 | HEAD "/" Sep 13 09:19:05 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:05 | 200 | 30.521045ms | 127.0.0.1 | POST "/api/show" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.309+02:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="61.4 GiB" before.free="51.8 GiB" before.free_swap="30.7 GiB" now.total="61.4 GiB" now.free="51.8 GiB" now.free_swap="30.7 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: CUDA driver version: 12.2 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.559+02:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce name=Orin overhead="0 B" before.total="61.4 GiB" before.free="51.6 GiB" now.total="61.4 GiB" now.free="51.5 GiB" now.used="9.8 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: releasing cuda driver library Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.560+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x7e6d00 gpu_count=1 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.657+02:00 level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.658+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.660+02:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce parallel=4 available=55335387136 required="6.2 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=INFO source=server.go:101 msg="system memory" total="61.4 GiB" free="51.8 GiB" free_swap="30.7 GiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.661+02:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[51.5 GiB]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.662+02:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[51.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.663+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1163128731/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44895" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.668+02:00 level=DEBUG source=server.go:408 msg=subprocess environment="[PATH=/home/utilisateur/go/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin LD_LIBRARY_PATH=/tmp/ollama1163128731/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-5fd13bbd-ef2f-5985-98ff-88638f51c2ce]" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.670+02:00 level=INFO source=server.go:590 msg="waiting for llama runner to start responding" Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.672+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server error" Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] build info | build=3661 commit="8962422b" tid="281472762038336" timestamp=1726211945 Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="281472762038336" timestamp=1726211945 total_threads=8 Sep 13 09:19:05 ubuntu ollama[17533]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="7" port="44895" tid="281472762038336" timestamp=1726211945 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 0: general.architecture str = llama Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 1: general.type str = model Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 3: general.finetune str = Instruct Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 5: general.size_label str = 8B Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 6: general.license str = llama3.1 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 9: llama.block_count u32 = 32 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 10: llama.context_length u32 = 131072 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 17: general.file_type u32 = 2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type f32: 66 tensors Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q4_0: 225 tensors Sep 13 09:19:05 ubuntu ollama[17505]: llama_model_loader: - type q6_K: 1 tensors Sep 13 09:19:05 ubuntu ollama[17505]: time=2024-09-13T09:19:05.924+02:00 level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model" Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: special tokens cache size = 256 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: format = GGUF V3 (latest) Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: arch = llama Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab type = BPE Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_vocab = 128256 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_merges = 280147 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: vocab_only = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_train = 131072 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd = 4096 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_layer = 32 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head = 32 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_head_kv = 8 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_rot = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_swa = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_k = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_head_v = 128 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_gqa = 4 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ff = 14336 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_expert_used = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: causal attn = 1 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: pooling type = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope type = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope scaling = linear Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_base_train = 500000.0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: freq_scale_train = 1 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: rope_finetuned = unknown Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_conv = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_inner = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_d_state = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_rank = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model type = 8B Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model ftype = Q4_0 Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model params = 8.03 B Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: LF token = 128 'Ä' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_print_meta: max token length = 256 Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 13 09:19:06 ubuntu ollama[17505]: ggml_cuda_init: found 1 CUDA devices: Sep 13 09:19:06 ubuntu ollama[17505]: Device 0: Orin, compute capability 8.7, VMM: yes Sep 13 09:19:06 ubuntu ollama[17505]: llm_load_tensors: ggml ctx size = 0.27 MiB Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloading non-repeating layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: CPU buffer size = 281.81 MiB Sep 13 09:19:07 ubuntu ollama[17505]: llm_load_tensors: CUDA0 buffer size = 4156.00 MiB Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.182+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.06" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.433+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.44" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.684+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.76" Sep 13 09:19:07 ubuntu ollama[17505]: time=2024-09-13T09:19:07.936+02:00 level=DEBUG source=server.go:635 msg="model load progress 0.99" Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ctx = 8192 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_batch = 512 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: n_ubatch = 512 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: flash_attn = 0 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_base = 500000.0 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: freq_scale = 1 Sep 13 09:19:08 ubuntu ollama[17505]: llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.187+02:00 level=DEBUG source=server.go:635 msg="model load progress 1.00" Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph nodes = 1030 Sep 13 09:19:08 ubuntu ollama[17505]: llama_new_context_with_model: graph splits = 2 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] initializing slots | n_slots=4 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: INFO [main] model loaded | tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="281472762038336" timestamp=1726211948 Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=INFO source=server.go:629 msg="llama runner started in 2.78 seconds" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.450+02:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:08 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:08 | 200 | 3.172840134s | 127.0.0.1 | POST "/api/generate" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:466 msg="context for request finished" Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s Sep 13 09:19:08 ubuntu ollama[17505]: time=2024-09-13T09:19:08.451+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.899+02:00 level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17505]: time=2024-09-13T09:19:14.906+02:00 level=DEBUG source=routes.go:1415 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=11 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:14 ubuntu ollama[17533]: DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211954 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] prompt eval time = 385.58 ms / 11 tokens ( 35.05 ms per token, 28.53 tokens per second) | n_prompt_tokens_processed=11 n_tokens_second=28.528450645780385 slot_id=0 t_prompt_processing=385.58 t_token=35.052727272727275 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] generation eval time = 947.29 ms / 10 runs ( 94.73 ms per token, 10.56 tokens per second) | n_decoded=10 n_tokens_second=10.55638481822961 slot_id=0 t_token=94.7294 t_token_generation=947.294 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [print_timings] total time = 1332.87 ms | slot_id=0 t_prompt_processing=385.58 t_token_generation=947.294 t_total=1332.874 task_id=3 tid="281472762038336" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [update_slots] slot released | n_cache_tokens=21 n_ctx=8192 n_past=20 n_system_tokens=0 slot_id=0 task_id=3 tid="281472762038336" timestamp=1726211956 truncated=false Sep 13 09:19:16 ubuntu ollama[17533]: DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=46016 status=200 tid="281472054396992" timestamp=1726211956 Sep 13 09:19:16 ubuntu ollama[17505]: [GIN] 2024/09/13 - 09:19:16 | 200 | 1.460359687s | 127.0.0.1 | POST "/api/chat" Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.287+02:00 level=DEBUG source=sched.go:407 msg="context for request finished" Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s Sep 13 09:19:16 ubuntu ollama[17505]: time=2024-09-13T09:19:16.288+02:00 level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 ``` </details> <ins>Details about my config : </ins> **Model**: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.0 [L4T 36.3.0] **NV Power Mode[2]**: MODE_30W **Hardware**: - P-Number: p3701-0005 - Module: NVIDIA Jetson AGX Orin (64GB ram) **Platform**: - Distribution: Ubuntu 22.04 Jammy Jellyfish - Release: 5.15.136-tegra **Libraries**: - CUDA: 12.2.140 - cuDNN: 8.9.4.25 - TensorRT: 8.6.2.3 - VPI: 3.1.5 - Vulkan: 1.3.204 - OpenCV: 4.8.0 - with CUDA: NO

GiteaMirror commented

2026-04-28 08:56:30 -05:00

@remy415 commented on GitHub (Sep 13, 2024):

@soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you.

Alternatively you can try the container approach on dustynv’s GitHub page.

@remy415 commented on GitHub (Sep 13, 2024): @soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you. Alternatively you can try the container approach on dustynv’s GitHub page.

GiteaMirror commented

2026-04-28 08:56:31 -05:00

@wilbert-vb commented on GitHub (Sep 13, 2024):

@soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you.

Alternatively you can try the container approach on dustynv’s GitHub page.

I can confirm that Ollama with a model loaded is responding to a prompt after renaming /usr/local/lib/ollama.
And the GPU is utilized.

Dustynv's container seems out of date and he is less motivated to keep up with the progress that Ollama is making. He even suggested to use llama.cpp (read https://github.com/dusty-nv/jetson-containers/issues/585#issuecomment-2316016480)

Many thanks, @soulisalmed and team!

@wilbert-vb commented on GitHub (Sep 13, 2024): > @soulisalmed thank you, yes that has been an ongoing issue - generic CUDA graphics drivers that work in most systems don’t play well with the Jetson drivers. I would speculate that nvidia is working on getting the Jetson firmware to a state where it works with their common drivers for this very reason. @wilbert-vb let us know if the fix of removing/renaming the install script driver directory works for you. > > Alternatively you can try the container approach on dustynv’s GitHub page. I can confirm that Ollama with a model loaded is responding to a prompt after renaming /usr/local/lib/ollama. And the GPU is utilized. <img width="632" alt="Screenshot 2024-09-13 at 07 24 34" src="https://github.com/user-attachments/assets/48068785-9ecf-47fd-abfd-b2385cfe8f06"> Dustynv's container seems out of date and he is less motivated to keep up with the progress that Ollama is making. He even suggested to use llama.cpp (read https://github.com/dusty-nv/jetson-containers/issues/585#issuecomment-2316016480) Many thanks, @soulisalmed and team!

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#48608