[GH-ISSUE #3930] GPU allocation lost after container idle period #64473

New Issue

GiteaMirror · 2026-05-03T17:47:37-05:00

GiteaMirror commented

2026-05-03 17:47:37 -05:00

Originally created by @hl-hok on GitHub (Apr 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3930

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I'm experiencing an issue with Ollama where the Docker container fails to utilize the GPU unless I restart the container. This occurs when the container remains idle for an extended period (e.g., a day).

Initially, the GPU is configured correctly and allocated to the container. However, after not using the LLM for a while, the container only utilizes the CPU and ignores the available GPU resources.

Restarting the Docker container resolves the issue, and the GPU is allocated again. I've verified that my GPU configuration is correct, and the Ollama service is running normally.

Steps to reproduce:

Run an LLM using Ollama in a Docker container with a correctly configured GPU.
Allow the container to remain idle for an extended period (e.g., a day).
Attempt to use the LLM again.
Observe that the container only utilizes the CPU and not the GPU.

Expected behavior:
The Docker container should continue to utilize the allocated GPU resources even after an extended idle period.

Environment:
Ollama version: 0.1.32
Docker version: 26.0.2
GPU driver version: CUDA 12.4
Kernel version: 6.5.0-27-generic

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @hl-hok on GitHub (Apr 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3930 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I'm experiencing an issue with Ollama where the Docker container fails to utilize the GPU unless I restart the container. This occurs when the container remains idle for an extended period (e.g., a day). Initially, the GPU is configured correctly and allocated to the container. However, after not using the LLM for a while, the container only utilizes the CPU and ignores the available GPU resources. Restarting the Docker container resolves the issue, and the GPU is allocated again. I've verified that my GPU configuration is correct, and the Ollama service is running normally. **Steps to reproduce:** 1. Run an LLM using Ollama in a Docker container with a correctly configured GPU. 2. Allow the container to remain idle for an extended period (e.g., a day). 3. Attempt to use the LLM again. 4. Observe that the container only utilizes the CPU and not the GPU. **Expected behavior:** The Docker container should continue to utilize the allocated GPU resources even after an extended idle period. **Environment:** Ollama version: 0.1.32 Docker version: 26.0.2 GPU driver version: CUDA 12.4 Kernel version: 6.5.0-27-generic ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32

GiteaMirror added the docker bug nvidia labels 2026-05-03 17:47:38 -05:00

GiteaMirror closed this issue

2026-05-03 17:47:39 -05:00

GiteaMirror commented

2026-05-03 17:47:42 -05:00

@gaye746560359 commented on GitHub (Apr 29, 2024):

me too

@gaye746560359 commented on GitHub (Apr 29, 2024): me too

GiteaMirror commented

2026-05-03 17:47:44 -05:00

@dhiltgen commented on GitHub (May 1, 2024):

Can you share the server log after the idle period so we can see why it's failing to discover the GPU? It may be helpful to set -e OLLAMA_DEBUG=1 on the container.

@dhiltgen commented on GitHub (May 1, 2024): Can you share the server log after the idle period so we can see why it's failing to discover the GPU? It may be helpful to set `-e OLLAMA_DEBUG=1` on the container.

GiteaMirror commented

2026-05-03 17:47:46 -05:00

@hl-hok commented on GitHub (May 2, 2024):

Hi @dhiltgen, I've just run a new container and removed the old one. I restarted the container once again with the debug setting. I'll share the log here once it fails again.

@hl-hok commented on GitHub (May 2, 2024): Hi @dhiltgen, I've just run a new container and removed the old one. I restarted the container once again with the debug setting. I'll share the log here once it fails again.

GiteaMirror commented

2026-05-03 17:47:46 -05:00

@yukaichao commented on GitHub (May 8, 2024):

https://blog.csdn.net/flipped_1121/article/details/137047698
You can try this.

@yukaichao commented on GitHub (May 8, 2024): https://blog.csdn.net/flipped_1121/article/details/137047698 You can try this.

GiteaMirror commented

2026-05-03 17:47:47 -05:00

@sammcj commented on GitHub (May 8, 2024):

I've noticed this as well, I end up restarting my container every 8 hours or so to ensure it doesn't end up using CPU.

I'll do the same and enable debug on my setup to try and catch it as well.

@sammcj commented on GitHub (May 8, 2024): I've noticed this as well, I end up restarting my container every 8 hours or so to ensure it doesn't end up using CPU. I'll do the same and enable debug on my setup to try and catch it as well.

GiteaMirror commented

2026-05-03 17:47:48 -05:00

@brivad commented on GitHub (May 9, 2024):

I'm experiencing a similar issue as described in this thread, but my setup differs slightly as I'm not using Docker. I have Ollama version 0.1.32 installed via apt on ubuntu 22 system and running as a service (ollama.service). Unfortunately, neither restarting Ollama nor the Nvidia stack resolves the issue. The only workaround I've found is to restart the entire system to get the GPU detected again.

@brivad commented on GitHub (May 9, 2024): I'm experiencing a similar issue as described in this thread, but my setup differs slightly as I'm not using Docker. I have Ollama version 0.1.32 installed via apt on ubuntu 22 system and running as a service (ollama.service). Unfortunately, neither restarting Ollama nor the Nvidia stack resolves the issue. The only workaround I've found is to restart the entire system to get the GPU detected again.

GiteaMirror commented

2026-05-03 17:47:49 -05:00

@dhiltgen commented on GitHub (May 10, 2024):

We've recently added some troubleshooting steps for nvidia drivers https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu which might be helpful in narrowing down your problem @brivad. If none of those solve it, can you share your server log showing attempts to load the model where it doesn't run on the GPU? I'm expecting there will be errors reported from the nvidia APIs we call that might help us understand what's going wrong.

@dhiltgen commented on GitHub (May 10, 2024): We've recently added some troubleshooting steps for nvidia drivers https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu which might be helpful in narrowing down your problem @brivad. If none of those solve it, can you share your server log showing attempts to load the model where it doesn't run on the GPU? I'm expecting there will be errors reported from the nvidia APIs we call that might help us understand what's going wrong.

GiteaMirror commented

2026-05-03 17:47:49 -05:00

@davmacario commented on GitHub (Oct 11, 2024):

I have the exact same issue

@davmacario commented on GitHub (Oct 11, 2024): I have the exact same issue

GiteaMirror commented

2026-05-03 17:47:49 -05:00

@gmacario commented on GitHub (Oct 11, 2024):

@dhiltgen dhiltgen closed this as completed on May 31

@dhiltgen just wondering whether this issue was closed because of some fixes, or is the workaround suggested by someone in this thread (i.e. restarting the container every 8h or so) the way to go.

I am also experiencing intermittent freezes of the latest release of the ollama/ollama image running on an up-to-date Ubuntu 24.04.1 LTS system:

gmacario@hw2482:~$ docker ps | grep ollama
01ca3aebb080   ollama/ollama                                    "/bin/ollama serve"      11 days ago   Up 5 hours             0.0.0.0:11435->11434/tcp, [::]:11435->11434/tcp   ollama
gmacario@hw2482:~$ docker pull ollama/ollama
Using default tag: latest
latest: Pulling from ollama/ollama
Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470
Status: Image is up to date for ollama/ollama:latest
docker.io/ollama/ollama:latest
gmacario@hw2482:~$ docker --version
Docker version 27.3.1, build ce12230
gmacario@hw2482:~$ uname -a
Linux hw2482 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
gmacario@hw2482:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
gmacario@hw2482:~$

I am happy to provide more details about my HW/SW configuration if it helps reopen this issue.

@gmacario commented on GitHub (Oct 11, 2024): > @[dhiltgen](https://github.com/dhiltgen) dhiltgen closed this as [completed](https://github.com/ollama/ollama/issues?q=is%3Aissue+is%3Aclosed+archived%3Afalse+reason%3Acompleted) [on May 31](https://github.com/ollama/ollama/issues/3930#event-13005959281) @dhiltgen just wondering whether this issue was closed because of some fixes, or is the workaround suggested by someone in this thread (i.e. restarting the container every 8h or so) the way to go. I am also experiencing intermittent freezes of the latest release of the `ollama/ollama` image running on an up-to-date Ubuntu 24.04.1 LTS system: ```text gmacario@hw2482:~$ docker ps | grep ollama 01ca3aebb080 ollama/ollama "/bin/ollama serve" 11 days ago Up 5 hours 0.0.0.0:11435->11434/tcp, [::]:11435->11434/tcp ollama gmacario@hw2482:~$ docker pull ollama/ollama Using default tag: latest latest: Pulling from ollama/ollama Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470 Status: Image is up to date for ollama/ollama:latest docker.io/ollama/ollama:latest gmacario@hw2482:~$ docker --version Docker version 27.3.1, build ce12230 gmacario@hw2482:~$ uname -a Linux hw2482 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux gmacario@hw2482:~$ cat /etc/os-release PRETTY_NAME="Ubuntu 24.04.1 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.1 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo gmacario@hw2482:~$ ``` I am happy to provide more details about my HW/SW configuration if it helps reopen this issue.

GiteaMirror commented

2026-05-03 17:47:50 -05:00

@andi20002000 commented on GitHub (Oct 14, 2024):

I also have the same issue, also on Ubuntu 24. Can post all system infos, if issue get´s reopened.

@andi20002000 commented on GitHub (Oct 14, 2024): I also have the same issue, also on Ubuntu 24. Can post all system infos, if issue get´s reopened.

GiteaMirror commented

2026-05-03 17:47:50 -05:00

@dhiltgen commented on GitHub (Oct 14, 2024):

@davmacario @gmacario @andi20002000 if the troubleshooting steps didn't resolve the problem, please post server logs so we can see what's going wrong during GPU discovery.

@dhiltgen commented on GitHub (Oct 14, 2024): @davmacario @gmacario @andi20002000 if the [troubleshooting steps](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu) didn't resolve the problem, please post server logs so we can see what's going wrong during GPU discovery.

GiteaMirror commented

2026-05-03 17:47:51 -05:00

@recrudesce commented on GitHub (Oct 14, 2024):

I too have this issue. Ollama just seems to lose the GPU if it's been idle for a certain amount of time. Restarting the container fixes the issue.

It's finding GPU fine on first startup, but after several hours of idling it switches to using CPU. I've just rebooted my container with the OLLAMA_DEBUG env variable set, and I'll see if I can get some logs for diagnostics.

@recrudesce commented on GitHub (Oct 14, 2024): I too have this issue. Ollama just seems to lose the GPU if it's been idle for a certain amount of time. Restarting the container fixes the issue. It's finding GPU fine on first startup, but after several hours of idling it switches to using CPU. I've just rebooted my container with the OLLAMA_DEBUG env variable set, and I'll see if I can get some logs for diagnostics.

GiteaMirror commented

2026-05-03 17:47:52 -05:00

@davmacario commented on GitHub (Oct 15, 2024):

Here are the logs. I removed a bunch of stuff that happened earlier (basically a request that correctly triggered the use of the GPU).

...
time=2024-10-15T11:55:03.503Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.7 GiB" now.total="10.9 GiB" now.free="3.1 GiB" now.used="7.8 GiB"
time=2024-10-15T11:55:03.635Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="2.9 GiB" now.used="8.1 GiB"
releasing cuda driver library
time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1097 msg="stopping llama server"
time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit"
time=2024-10-15T11:55:03.729Z level=DEBUG source=server.go:1107 msg="llama server stopped"
time=2024-10-15T11:55:03.729Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:03.886Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.0 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
time=2024-10-15T11:55:04.043Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="3.1 GiB" now.total="10.9 GiB" now.free="10.7 GiB" now.used="220.6 MiB"
time=2024-10-15T11:55:04.146Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="2.9 GiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="144.9 MiB"
releasing cuda driver library
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.76 seconds" model=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"
[GIN] 2024/10/15 - 16:55:14 | 200 |      34.568µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/15 - 16:55:14 | 200 |    47.78086ms |       127.0.0.1 | POST     "/api/show"
time=2024-10-15T16:55:14.477Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T16:55:14.516Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T16:55:14.580Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T16:55:14.580Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-10-15T16:55:14.581Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 parallel=4 available=11562909696 required="6.2 GiB"
time=2024-10-15T16:55:14.581Z level=INFO source=server.go:108 msg="system memory" total="31.2 GiB" free="29.3 GiB" free_swap="8.0 GiB"
time=2024-10-15T16:55:14.581Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-10-15T16:55:14.582Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
time=2024-10-15T16:55:14.588Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44195"
time=2024-10-15T16:55:14.588Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450]"
time=2024-10-15T16:55:14.588Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-15T16:55:14.588Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2024-10-15T16:55:14.589Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
INFO [main] starting c++ runner | tid="134069305147392" timestamp=1729011314
INFO [main] build info | build=10 commit="9794cea" tid="134069305147392" timestamp=1729011314
INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134069305147392" timestamp=1729011314 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44195" tid="134069305147392" timestamp=1729011314
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-10-15T16:55:14.840Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
time=2024-10-15T16:55:15.593Z level=DEBUG source=server.go:643 msg="model load progress 1.00"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected
time=2024-10-15T16:55:15.843Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:        CPU  output buffer size =     2.02 MiB
ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:  CUDA_Host compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
DEBUG [initialize] initializing slots | n_slots=4 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="134069305147392" timestamp=1729011316
INFO [main] model loaded | tid="134069305147392" timestamp=1729011316
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134069305147392" timestamp=1729011316
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="134069305147392" timestamp=1729011316
time=2024-10-15T16:55:16.346Z level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds"
time=2024-10-15T16:55:16.346Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
[GIN] 2024/10/15 - 16:55:16 | 200 |  1.909902263s |       127.0.0.1 | POST     "/api/generate"
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
time=2024-10-15T16:55:31.161Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="134069305147392" timestamp=1729011331
time=2024-10-15T16:55:31.163Z level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello! How are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="134069305147392" timestamp=1729011331
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=17 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [print_timings] prompt eval time     =     967.38 ms /    17 tokens (   56.90 ms per token,    17.57 tokens per second) | n_prompt_tokens_processed=17 n_tokens_second=17.573257223900868 slot_id=0 t_prompt_processing=967.379 t_token=56.90464705882353 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [print_timings] generation eval time =    5444.23 ms /    51 runs   (  106.75 ms per token,     9.37 tokens per second) | n_decoded=51 n_tokens_second=9.3677107500726 slot_id=0 t_token=106.74966666666667 t_token_generation=5444.233 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [print_timings]           total time =    6411.61 ms | slot_id=0 t_prompt_processing=967.379 t_token_generation=5444.233 t_total=6411.612 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [update_slots] slot released | n_cache_tokens=68 n_ctx=8192 n_past=67 n_system_tokens=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011337 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=51496 status=200 tid="134067313967104" timestamp=1729011337
[GIN] 2024/10/15 - 16:55:37 | 200 |  6.502089485s |       127.0.0.1 | POST     "/api/chat"
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="28.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.654Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.692Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1097 msg="stopping llama server"
time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit"
time=2024-10-15T17:00:37.888Z level=DEBUG source=server.go:1107 msg="llama server stopped"
time=2024-10-15T17:00:37.888Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="28.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.981Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.003Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.232Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.251Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.481Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.501Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.750Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.998Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.273Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.488Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.512Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.744Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.975Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.007Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.229Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.262Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.495Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.519Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.745Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.013Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.228Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.248Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.500Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.749Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.977Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.997Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.261Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.499Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.692Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07412869 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"
time=2024-10-15T17:00:42.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.731Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.752Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.942Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323969175 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.942Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.979Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:43.000Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:43.192Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573848459 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

I apologize for the very verbose output :) I also saved the full log (from when the container was started), let me know if you want me to paste something specific.

The logs clearly show there are issues with reading the GPU memory.

cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"

For reference, me and @gmacario are using a system with the following specs:

CPU: Intel Core i9-7920X
RAM: 32 GB
GPU: 2x Nvidia 1080 Ti

@davmacario commented on GitHub (Oct 15, 2024): Here are the logs. I removed a bunch of stuff that happened earlier (basically a request that correctly triggered the use of the GPU). ```text ... time=2024-10-15T11:55:03.503Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.7 GiB" now.total="10.9 GiB" now.free="3.1 GiB" now.used="7.8 GiB" time=2024-10-15T11:55:03.635Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="2.9 GiB" now.used="8.1 GiB" releasing cuda driver library time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1097 msg="stopping llama server" time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit" time=2024-10-15T11:55:03.729Z level=DEBUG source=server.go:1107 msg="llama server stopped" time=2024-10-15T11:55:03.729Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:03.886Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.0 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 time=2024-10-15T11:55:04.043Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="3.1 GiB" now.total="10.9 GiB" now.free="10.7 GiB" now.used="220.6 MiB" time=2024-10-15T11:55:04.146Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="2.9 GiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="144.9 MiB" releasing cuda driver library time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.76 seconds" model=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" [GIN] 2024/10/15 - 16:55:14 | 200 | 34.568µs | 127.0.0.1 | HEAD "/" [GIN] 2024/10/15 - 16:55:14 | 200 | 47.78086ms | 127.0.0.1 | POST "/api/show" time=2024-10-15T16:55:14.477Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T16:55:14.516Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T16:55:14.580Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T16:55:14.580Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-10-15T16:55:14.581Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 parallel=4 available=11562909696 required="6.2 GiB" time=2024-10-15T16:55:14.581Z level=INFO source=server.go:108 msg="system memory" total="31.2 GiB" free="29.3 GiB" free_swap="8.0 GiB" time=2024-10-15T16:55:14.581Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-10-15T16:55:14.582Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server time=2024-10-15T16:55:14.588Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44195" time=2024-10-15T16:55:14.588Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450]" time=2024-10-15T16:55:14.588Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-15T16:55:14.588Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2024-10-15T16:55:14.589Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [main] starting c++ runner | tid="134069305147392" timestamp=1729011314 INFO [main] build info | build=10 commit="9794cea" tid="134069305147392" timestamp=1729011314 INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134069305147392" timestamp=1729011314 total_threads=24 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44195" tid="134069305147392" timestamp=1729011314 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2024-10-15T16:55:14.840Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 4437.80 MiB time=2024-10-15T16:55:15.593Z level=DEBUG source=server.go:643 msg="model load progress 1.00" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected time=2024-10-15T16:55:15.843Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CPU output buffer size = 2.02 MiB ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CUDA_Host compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 DEBUG [initialize] initializing slots | n_slots=4 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="134069305147392" timestamp=1729011316 INFO [main] model loaded | tid="134069305147392" timestamp=1729011316 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134069305147392" timestamp=1729011316 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="134069305147392" timestamp=1729011316 time=2024-10-15T16:55:16.346Z level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds" time=2024-10-15T16:55:16.346Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe [GIN] 2024/10/15 - 16:55:16 | 200 | 1.909902263s | 127.0.0.1 | POST "/api/generate" time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 time=2024-10-15T16:55:31.161Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="134069305147392" timestamp=1729011331 time=2024-10-15T16:55:31.163Z level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello! How are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="134069305147392" timestamp=1729011331 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=17 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [print_timings] prompt eval time = 967.38 ms / 17 tokens ( 56.90 ms per token, 17.57 tokens per second) | n_prompt_tokens_processed=17 n_tokens_second=17.573257223900868 slot_id=0 t_prompt_processing=967.379 t_token=56.90464705882353 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [print_timings] generation eval time = 5444.23 ms / 51 runs ( 106.75 ms per token, 9.37 tokens per second) | n_decoded=51 n_tokens_second=9.3677107500726 slot_id=0 t_token=106.74966666666667 t_token_generation=5444.233 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [print_timings] total time = 6411.61 ms | slot_id=0 t_prompt_processing=967.379 t_token_generation=5444.233 t_total=6411.612 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [update_slots] slot released | n_cache_tokens=68 n_ctx=8192 n_past=67 n_system_tokens=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011337 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=51496 status=200 tid="134067313967104" timestamp=1729011337 [GIN] 2024/10/15 - 16:55:37 | 200 | 6.502089485s | 127.0.0.1 | POST "/api/chat" time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="28.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:37.654Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:37.692Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1097 msg="stopping llama server" time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit" time=2024-10-15T17:00:37.888Z level=DEBUG source=server.go:1107 msg="llama server stopped" time=2024-10-15T17:00:37.888Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="28.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:37.981Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.003Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.232Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.251Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.481Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.501Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.750Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.998Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.273Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.488Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.512Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.744Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.975Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.007Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.229Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.262Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.495Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.519Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.745Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.013Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.228Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.248Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.500Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.749Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.977Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.997Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.261Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.499Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.692Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07412869 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" time=2024-10-15T17:00:42.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.731Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.752Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.942Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323969175 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.942Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.979Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:43.000Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:43.192Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573848459 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe ``` I apologize for the _very_ verbose output :) I also saved the full log (from when the container was started), let me know if you want me to paste something specific. The logs clearly show there are issues with reading the GPU memory. ```text cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" ``` For reference, me and @gmacario are using a system with the following specs: - CPU: Intel Core i9-7920X - RAM: 32 GB - GPU: 2x Nvidia 1080 Ti

GiteaMirror commented

2026-05-03 17:47:52 -05:00

@dhiltgen commented on GitHub (Oct 15, 2024):

@davmacario you hit the cuda driver library failed to get device context 800 error which is tracked via #6928

@dhiltgen commented on GitHub (Oct 15, 2024): @davmacario you hit the `cuda driver library failed to get device context 800` error which is tracked via #6928

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#64473