[GH-ISSUE #3930] GPU allocation lost after container idle period #64473

Closed
opened 2026-05-03 17:47:37 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @hl-hok on GitHub (Apr 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3930

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I'm experiencing an issue with Ollama where the Docker container fails to utilize the GPU unless I restart the container. This occurs when the container remains idle for an extended period (e.g., a day).

Initially, the GPU is configured correctly and allocated to the container. However, after not using the LLM for a while, the container only utilizes the CPU and ignores the available GPU resources.

Restarting the Docker container resolves the issue, and the GPU is allocated again. I've verified that my GPU configuration is correct, and the Ollama service is running normally.

Steps to reproduce:

  1. Run an LLM using Ollama in a Docker container with a correctly configured GPU.
  2. Allow the container to remain idle for an extended period (e.g., a day).
  3. Attempt to use the LLM again.
  4. Observe that the container only utilizes the CPU and not the GPU.

Expected behavior:
The Docker container should continue to utilize the allocated GPU resources even after an extended idle period.

Environment:
Ollama version: 0.1.32
Docker version: 26.0.2
GPU driver version: CUDA 12.4
Kernel version: 6.5.0-27-generic

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.32

Originally created by @hl-hok on GitHub (Apr 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3930 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I'm experiencing an issue with Ollama where the Docker container fails to utilize the GPU unless I restart the container. This occurs when the container remains idle for an extended period (e.g., a day). Initially, the GPU is configured correctly and allocated to the container. However, after not using the LLM for a while, the container only utilizes the CPU and ignores the available GPU resources. Restarting the Docker container resolves the issue, and the GPU is allocated again. I've verified that my GPU configuration is correct, and the Ollama service is running normally. **Steps to reproduce:** 1. Run an LLM using Ollama in a Docker container with a correctly configured GPU. 2. Allow the container to remain idle for an extended period (e.g., a day). 3. Attempt to use the LLM again. 4. Observe that the container only utilizes the CPU and not the GPU. **Expected behavior:** The Docker container should continue to utilize the allocated GPU resources even after an extended idle period. **Environment:** Ollama version: 0.1.32 Docker version: 26.0.2 GPU driver version: CUDA 12.4 Kernel version: 6.5.0-27-generic ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.32
GiteaMirror added the dockerbugnvidia labels 2026-05-03 17:47:38 -05:00
Author
Owner

@gaye746560359 commented on GitHub (Apr 29, 2024):

me too

<!-- gh-comment-id:2082959345 --> @gaye746560359 commented on GitHub (Apr 29, 2024): me too
Author
Owner

@dhiltgen commented on GitHub (May 1, 2024):

Can you share the server log after the idle period so we can see why it's failing to discover the GPU? It may be helpful to set -e OLLAMA_DEBUG=1 on the container.

<!-- gh-comment-id:2089166248 --> @dhiltgen commented on GitHub (May 1, 2024): Can you share the server log after the idle period so we can see why it's failing to discover the GPU? It may be helpful to set `-e OLLAMA_DEBUG=1` on the container.
Author
Owner

@hl-hok commented on GitHub (May 2, 2024):

Hi @dhiltgen, I've just run a new container and removed the old one. I restarted the container once again with the debug setting. I'll share the log here once it fails again.

<!-- gh-comment-id:2089329833 --> @hl-hok commented on GitHub (May 2, 2024): Hi @dhiltgen, I've just run a new container and removed the old one. I restarted the container once again with the debug setting. I'll share the log here once it fails again.
Author
Owner

@yukaichao commented on GitHub (May 8, 2024):

https://blog.csdn.net/flipped_1121/article/details/137047698
You can try this.

<!-- gh-comment-id:2099537402 --> @yukaichao commented on GitHub (May 8, 2024): https://blog.csdn.net/flipped_1121/article/details/137047698 You can try this.
Author
Owner

@sammcj commented on GitHub (May 8, 2024):

I've noticed this as well, I end up restarting my container every 8 hours or so to ensure it doesn't end up using CPU.

I'll do the same and enable debug on my setup to try and catch it as well.

<!-- gh-comment-id:2099695286 --> @sammcj commented on GitHub (May 8, 2024): I've noticed this as well, I end up restarting my container every 8 hours or so to ensure it doesn't end up using CPU. I'll do the same and enable debug on my setup to try and catch it as well.
Author
Owner

@brivad commented on GitHub (May 9, 2024):

I'm experiencing a similar issue as described in this thread, but my setup differs slightly as I'm not using Docker. I have Ollama version 0.1.32 installed via apt on ubuntu 22 system and running as a service (ollama.service). Unfortunately, neither restarting Ollama nor the Nvidia stack resolves the issue. The only workaround I've found is to restart the entire system to get the GPU detected again.

<!-- gh-comment-id:2102001939 --> @brivad commented on GitHub (May 9, 2024): I'm experiencing a similar issue as described in this thread, but my setup differs slightly as I'm not using Docker. I have Ollama version 0.1.32 installed via apt on ubuntu 22 system and running as a service (ollama.service). Unfortunately, neither restarting Ollama nor the Nvidia stack resolves the issue. The only workaround I've found is to restart the entire system to get the GPU detected again.
Author
Owner

@dhiltgen commented on GitHub (May 10, 2024):

We've recently added some troubleshooting steps for nvidia drivers https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu which might be helpful in narrowing down your problem @brivad. If none of those solve it, can you share your server log showing attempts to load the model where it doesn't run on the GPU? I'm expecting there will be errors reported from the nvidia APIs we call that might help us understand what's going wrong.

<!-- gh-comment-id:2105228694 --> @dhiltgen commented on GitHub (May 10, 2024): We've recently added some troubleshooting steps for nvidia drivers https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu which might be helpful in narrowing down your problem @brivad. If none of those solve it, can you share your server log showing attempts to load the model where it doesn't run on the GPU? I'm expecting there will be errors reported from the nvidia APIs we call that might help us understand what's going wrong.
Author
Owner

@davmacario commented on GitHub (Oct 11, 2024):

I have the exact same issue

<!-- gh-comment-id:2406760398 --> @davmacario commented on GitHub (Oct 11, 2024): I have the exact same issue
Author
Owner

@gmacario commented on GitHub (Oct 11, 2024):

@dhiltgen dhiltgen closed this as completed on May 31

@dhiltgen just wondering whether this issue was closed because of some fixes, or is the workaround suggested by someone in this thread (i.e. restarting the container every 8h or so) the way to go.

I am also experiencing intermittent freezes of the latest release of the ollama/ollama image running on an up-to-date Ubuntu 24.04.1 LTS system:

gmacario@hw2482:~$ docker ps | grep ollama
01ca3aebb080   ollama/ollama                                    "/bin/ollama serve"      11 days ago   Up 5 hours             0.0.0.0:11435->11434/tcp, [::]:11435->11434/tcp   ollama
gmacario@hw2482:~$ docker pull ollama/ollama
Using default tag: latest
latest: Pulling from ollama/ollama
Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470
Status: Image is up to date for ollama/ollama:latest
docker.io/ollama/ollama:latest
gmacario@hw2482:~$ docker --version
Docker version 27.3.1, build ce12230
gmacario@hw2482:~$ uname -a
Linux hw2482 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
gmacario@hw2482:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
gmacario@hw2482:~$

I am happy to provide more details about my HW/SW configuration if it helps reopen this issue.

<!-- gh-comment-id:2407282942 --> @gmacario commented on GitHub (Oct 11, 2024): > @[dhiltgen](https://github.com/dhiltgen) dhiltgen closed this as [completed](https://github.com/ollama/ollama/issues?q=is%3Aissue+is%3Aclosed+archived%3Afalse+reason%3Acompleted) [on May 31](https://github.com/ollama/ollama/issues/3930#event-13005959281) @dhiltgen just wondering whether this issue was closed because of some fixes, or is the workaround suggested by someone in this thread (i.e. restarting the container every 8h or so) the way to go. I am also experiencing intermittent freezes of the latest release of the `ollama/ollama` image running on an up-to-date Ubuntu 24.04.1 LTS system: ```text gmacario@hw2482:~$ docker ps | grep ollama 01ca3aebb080 ollama/ollama "/bin/ollama serve" 11 days ago Up 5 hours 0.0.0.0:11435->11434/tcp, [::]:11435->11434/tcp ollama gmacario@hw2482:~$ docker pull ollama/ollama Using default tag: latest latest: Pulling from ollama/ollama Digest: sha256:e458178cf2c114a22e1fe954dd9a92c785d1be686578a6c073a60cf259875470 Status: Image is up to date for ollama/ollama:latest docker.io/ollama/ollama:latest gmacario@hw2482:~$ docker --version Docker version 27.3.1, build ce12230 gmacario@hw2482:~$ uname -a Linux hw2482 6.8.0-45-generic #45-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 30 12:02:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux gmacario@hw2482:~$ cat /etc/os-release PRETTY_NAME="Ubuntu 24.04.1 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.1 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo gmacario@hw2482:~$ ``` I am happy to provide more details about my HW/SW configuration if it helps reopen this issue.
Author
Owner

@andi20002000 commented on GitHub (Oct 14, 2024):

I also have the same issue, also on Ubuntu 24. Can post all system infos, if issue get´s reopened.

<!-- gh-comment-id:2411114157 --> @andi20002000 commented on GitHub (Oct 14, 2024): I also have the same issue, also on Ubuntu 24. Can post all system infos, if issue get´s reopened.
Author
Owner

@dhiltgen commented on GitHub (Oct 14, 2024):

@davmacario @gmacario @andi20002000 if the troubleshooting steps didn't resolve the problem, please post server logs so we can see what's going wrong during GPU discovery.

<!-- gh-comment-id:2411639324 --> @dhiltgen commented on GitHub (Oct 14, 2024): @davmacario @gmacario @andi20002000 if the [troubleshooting steps](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#container-fails-to-run-on-nvidia-gpu) didn't resolve the problem, please post server logs so we can see what's going wrong during GPU discovery.
Author
Owner

@recrudesce commented on GitHub (Oct 14, 2024):

I too have this issue. Ollama just seems to lose the GPU if it's been idle for a certain amount of time. Restarting the container fixes the issue.

It's finding GPU fine on first startup, but after several hours of idling it switches to using CPU. I've just rebooted my container with the OLLAMA_DEBUG env variable set, and I'll see if I can get some logs for diagnostics.

<!-- gh-comment-id:2412486002 --> @recrudesce commented on GitHub (Oct 14, 2024): I too have this issue. Ollama just seems to lose the GPU if it's been idle for a certain amount of time. Restarting the container fixes the issue. It's finding GPU fine on first startup, but after several hours of idling it switches to using CPU. I've just rebooted my container with the OLLAMA_DEBUG env variable set, and I'll see if I can get some logs for diagnostics.
Author
Owner

@davmacario commented on GitHub (Oct 15, 2024):

Here are the logs. I removed a bunch of stuff that happened earlier (basically a request that correctly triggered the use of the GPU).

...
time=2024-10-15T11:55:03.503Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.7 GiB" now.total="10.9 GiB" now.free="3.1 GiB" now.used="7.8 GiB"
time=2024-10-15T11:55:03.635Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="2.9 GiB" now.used="8.1 GiB"
releasing cuda driver library
time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1097 msg="stopping llama server"
time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit"
time=2024-10-15T11:55:03.729Z level=DEBUG source=server.go:1107 msg="llama server stopped"
time=2024-10-15T11:55:03.729Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:03.886Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.0 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
time=2024-10-15T11:55:04.043Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="3.1 GiB" now.total="10.9 GiB" now.free="10.7 GiB" now.used="220.6 MiB"
time=2024-10-15T11:55:04.146Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="2.9 GiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="144.9 MiB"
releasing cuda driver library
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.76 seconds" model=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"
[GIN] 2024/10/15 - 16:55:14 | 200 |      34.568µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/10/15 - 16:55:14 | 200 |    47.78086ms |       127.0.0.1 | POST     "/api/show"
time=2024-10-15T16:55:14.477Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T16:55:14.516Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T16:55:14.580Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T16:55:14.580Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-10-15T16:55:14.581Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 parallel=4 available=11562909696 required="6.2 GiB"
time=2024-10-15T16:55:14.581Z level=INFO source=server.go:108 msg="system memory" total="31.2 GiB" free="29.3 GiB" free_swap="8.0 GiB"
time=2024-10-15T16:55:14.581Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-10-15T16:55:14.582Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server
time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server
time=2024-10-15T16:55:14.588Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44195"
time=2024-10-15T16:55:14.588Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450]"
time=2024-10-15T16:55:14.588Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-15T16:55:14.588Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2024-10-15T16:55:14.589Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
INFO [main] starting c++ runner | tid="134069305147392" timestamp=1729011314
INFO [main] build info | build=10 commit="9794cea" tid="134069305147392" timestamp=1729011314
INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134069305147392" timestamp=1729011314 total_threads=24
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44195" tid="134069305147392" timestamp=1729011314
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2024-10-15T16:55:14.840Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
time=2024-10-15T16:55:15.593Z level=DEBUG source=server.go:643 msg="model load progress 1.00"
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected
time=2024-10-15T16:55:15.843Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:        CPU  output buffer size =     2.02 MiB
ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:  CUDA_Host compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
DEBUG [initialize] initializing slots | n_slots=4 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="134069305147392" timestamp=1729011316
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="134069305147392" timestamp=1729011316
INFO [main] model loaded | tid="134069305147392" timestamp=1729011316
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134069305147392" timestamp=1729011316
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="134069305147392" timestamp=1729011316
time=2024-10-15T16:55:16.346Z level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds"
time=2024-10-15T16:55:16.346Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
[GIN] 2024/10/15 - 16:55:16 | 200 |  1.909902263s |       127.0.0.1 | POST     "/api/generate"
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:466 msg="context for request finished"
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
time=2024-10-15T16:55:31.161Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="134069305147392" timestamp=1729011331
time=2024-10-15T16:55:31.163Z level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello! How are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="134069305147392" timestamp=1729011331
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=17 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331
DEBUG [print_timings] prompt eval time     =     967.38 ms /    17 tokens (   56.90 ms per token,    17.57 tokens per second) | n_prompt_tokens_processed=17 n_tokens_second=17.573257223900868 slot_id=0 t_prompt_processing=967.379 t_token=56.90464705882353 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [print_timings] generation eval time =    5444.23 ms /    51 runs   (  106.75 ms per token,     9.37 tokens per second) | n_decoded=51 n_tokens_second=9.3677107500726 slot_id=0 t_token=106.74966666666667 t_token_generation=5444.233 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [print_timings]           total time =    6411.61 ms | slot_id=0 t_prompt_processing=967.379 t_token_generation=5444.233 t_total=6411.612 task_id=3 tid="134069305147392" timestamp=1729011337
DEBUG [update_slots] slot released | n_cache_tokens=68 n_ctx=8192 n_past=67 n_system_tokens=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011337 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=51496 status=200 tid="134067313967104" timestamp=1729011337
[GIN] 2024/10/15 - 16:55:37 | 200 |  6.502089485s |       127.0.0.1 | POST     "/api/chat"
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:407 msg="context for request finished"
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s
time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.618Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="28.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.654Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.692Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1097 msg="stopping llama server"
time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit"
time=2024-10-15T17:00:37.888Z level=DEBUG source=server.go:1107 msg="llama server stopped"
time=2024-10-15T17:00:37.888Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:37.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="28.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:37.981Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.003Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.232Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.251Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.481Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.501Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.750Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:38.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:38.998Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.273Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.488Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.512Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.744Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:39.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:39.975Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.007Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.229Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.262Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.495Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.519Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.745Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:40.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:40.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.013Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.228Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.248Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.500Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.749Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:41.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.977Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:41.997Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.261Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.499Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.692Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07412869 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"
time=2024-10-15T17:00:42.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.731Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.752Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:42.942Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323969175 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-10-15T17:00:42.942Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB"
CUDA driver version: 12.2
cuda driver library failed to get device context 800time=2024-10-15T17:00:42.979Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 800time=2024-10-15T17:00:43.000Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"
releasing cuda driver library
time=2024-10-15T17:00:43.192Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573848459 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe

I apologize for the very verbose output :) I also saved the full log (from when the container was started), let me know if you want me to paste something specific.

The logs clearly show there are issues with reading the GPU memory.

cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory"

For reference, me and @gmacario are using a system with the following specs:

  • CPU: Intel Core i9-7920X
  • RAM: 32 GB
  • GPU: 2x Nvidia 1080 Ti
<!-- gh-comment-id:2414566177 --> @davmacario commented on GitHub (Oct 15, 2024): Here are the logs. I removed a bunch of stuff that happened earlier (basically a request that correctly triggered the use of the GPU). ```text ... time=2024-10-15T11:55:03.503Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.7 GiB" now.total="10.9 GiB" now.free="3.1 GiB" now.used="7.8 GiB" time=2024-10-15T11:55:03.635Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.8 GiB" now.total="10.9 GiB" now.free="2.9 GiB" now.used="8.1 GiB" releasing cuda driver library time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1097 msg="stopping llama server" time=2024-10-15T11:55:03.635Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit" time=2024-10-15T11:55:03.729Z level=DEBUG source=server.go:1107 msg="llama server stopped" time=2024-10-15T11:55:03.729Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:03.886Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.0 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 time=2024-10-15T11:55:04.043Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-70f3b7ab-94d4-9950-8a8b-e8f2c853f85d name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="3.1 GiB" now.total="10.9 GiB" now.free="10.7 GiB" now.used="220.6 MiB" time=2024-10-15T11:55:04.146Z level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="2.9 GiB" now.total="10.9 GiB" now.free="10.8 GiB" now.used="144.9 MiB" releasing cuda driver library time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 0.76 seconds" model=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-10-15T11:55:04.146Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" [GIN] 2024/10/15 - 16:55:14 | 200 | 34.568µs | 127.0.0.1 | HEAD "/" [GIN] 2024/10/15 - 16:55:14 | 200 | 47.78086ms | 127.0.0.1 | POST "/api/show" time=2024-10-15T16:55:14.477Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.4 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T16:55:14.516Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T16:55:14.580Z level=DEBUG source=sched.go:224 msg="loading first model" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T16:55:14.580Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-10-15T16:55:14.581Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450 parallel=4 available=11562909696 required="6.2 GiB" time=2024-10-15T16:55:14.581Z level=INFO source=server.go:108 msg="system memory" total="31.2 GiB" free="29.3 GiB" free_swap="8.0 GiB" time=2024-10-15T16:55:14.581Z level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-10-15T16:55:14.582Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cpu_avx2/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v11/ollama_llama_server time=2024-10-15T16:55:14.583Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/usr/lib/ollama/runners/cuda_v12/ollama_llama_server time=2024-10-15T16:55:14.588Z level=INFO source=server.go:399 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --verbose --parallel 4 --port 44195" time=2024-10-15T16:55:14.588Z level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/runners/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_VISIBLE_DEVICES=GPU-a7422633-d0d3-1a33-798c-0d95dc91a450]" time=2024-10-15T16:55:14.588Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-15T16:55:14.588Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2024-10-15T16:55:14.589Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" INFO [main] starting c++ runner | tid="134069305147392" timestamp=1729011314 INFO [main] build info | build=10 commit="9794cea" tid="134069305147392" timestamp=1729011314 INFO [main] system info | n_threads=12 n_threads_batch=12 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="134069305147392" timestamp=1729011314 total_threads=24 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="23" port="44195" tid="134069305147392" timestamp=1729011314 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 2 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2024-10-15T16:55:14.840Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.7999 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 4437.80 MiB time=2024-10-15T16:55:15.593Z level=DEBUG source=server.go:643 msg="model load progress 1.00" llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 1024.00 MiB of pinned memory: no CUDA-capable device is detected time=2024-10-15T16:55:15.843Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB ggml_cuda_host_malloc: failed to allocate 2.02 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CPU output buffer size = 2.02 MiB ggml_cuda_host_malloc: failed to allocate 560.01 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CUDA_Host compute buffer size = 560.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 DEBUG [initialize] initializing slots | n_slots=4 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="134069305147392" timestamp=1729011316 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="134069305147392" timestamp=1729011316 INFO [main] model loaded | tid="134069305147392" timestamp=1729011316 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="134069305147392" timestamp=1729011316 DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="134069305147392" timestamp=1729011316 time=2024-10-15T16:55:16.346Z level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds" time=2024-10-15T16:55:16.346Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe [GIN] 2024/10/15 - 16:55:16 | 200 | 1.909902263s | 127.0.0.1 | POST "/api/generate" time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:466 msg="context for request finished" time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s time=2024-10-15T16:55:16.347Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 time=2024-10-15T16:55:31.161Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="134069305147392" timestamp=1729011331 time=2024-10-15T16:55:31.163Z level=DEBUG source=routes.go:1422 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nHello! How are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="134069305147392" timestamp=1729011331 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=17 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011331 DEBUG [print_timings] prompt eval time = 967.38 ms / 17 tokens ( 56.90 ms per token, 17.57 tokens per second) | n_prompt_tokens_processed=17 n_tokens_second=17.573257223900868 slot_id=0 t_prompt_processing=967.379 t_token=56.90464705882353 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [print_timings] generation eval time = 5444.23 ms / 51 runs ( 106.75 ms per token, 9.37 tokens per second) | n_decoded=51 n_tokens_second=9.3677107500726 slot_id=0 t_token=106.74966666666667 t_token_generation=5444.233 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [print_timings] total time = 6411.61 ms | slot_id=0 t_prompt_processing=967.379 t_token_generation=5444.233 t_total=6411.612 task_id=3 tid="134069305147392" timestamp=1729011337 DEBUG [update_slots] slot released | n_cache_tokens=68 n_ctx=8192 n_past=67 n_system_tokens=0 slot_id=0 task_id=3 tid="134069305147392" timestamp=1729011337 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=51496 status=200 tid="134067313967104" timestamp=1729011337 [GIN] 2024/10/15 - 16:55:37 | 200 | 6.502089485s | 127.0.0.1 | POST "/api/chat" time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:407 msg="context for request finished" time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe duration=5m0s time=2024-10-15T16:55:37.618Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe refCount=0 time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:341 msg="timer expired, expiring to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.618Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.4 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="28.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:37.654Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:37.692Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1097 msg="stopping llama server" time=2024-10-15T17:00:37.692Z level=DEBUG source=server.go:1103 msg="waiting for llama server to exit" time=2024-10-15T17:00:37.888Z level=DEBUG source=server.go:1107 msg="llama server stopped" time=2024-10-15T17:00:37.888Z level=DEBUG source=sched.go:380 msg="runner released" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:37.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="28.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:37.981Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.003Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.232Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.251Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.481Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.501Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.750Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:38.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:38.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:38.998Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.273Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.488Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.512Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.3 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:39.744Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:39.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.3 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:39.975Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.007Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.229Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.262Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.495Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.519Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.726Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:40.745Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:40.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:40.978Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.013Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.193Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.228Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.248Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.442Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.500Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.730Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.749Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:41.943Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:41.977Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:41.997Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.192Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.237Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.261Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.443Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.479Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.499Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.692Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.07412869 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.692Z level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests" time=2024-10-15T17:00:42.693Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.731Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:42.752Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:42.942Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.323969175 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe time=2024-10-15T17:00:42.942Z level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.2 GiB" before.free="29.2 GiB" before.free_swap="8.0 GiB" now.total="31.2 GiB" now.free="29.2 GiB" now.free_swap="8.0 GiB" CUDA driver version: 12.2 cuda driver library failed to get device context 800time=2024-10-15T17:00:42.979Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 800time=2024-10-15T17:00:43.000Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" releasing cuda driver library time=2024-10-15T17:00:43.192Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.573848459 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe ``` I apologize for the _very_ verbose output :) I also saved the full log (from when the container was started), let me know if you want me to paste something specific. The logs clearly show there are issues with reading the GPU memory. ```text cuda driver library failed to get device context 800time=2024-10-15T16:55:14.496Z level=WARN source=gpu.go:400 msg="error looking up nvidia GPU memory" ``` For reference, me and @gmacario are using a system with the following specs: - CPU: Intel Core i9-7920X - RAM: 32 GB - GPU: 2x Nvidia 1080 Ti
Author
Owner

@dhiltgen commented on GitHub (Oct 15, 2024):

@davmacario you hit the cuda driver library failed to get device context 800 error which is tracked via #6928

<!-- gh-comment-id:2414801508 --> @dhiltgen commented on GitHub (Oct 15, 2024): @davmacario you hit the `cuda driver library failed to get device context 800` error which is tracked via #6928
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64473