[GH-ISSUE #8271] llama runner process terminated: CUDA error #5289

Closed
opened 2026-04-12 16:28:04 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @iplayfast on GitHub (Dec 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8271

What is the issue?

I'm making a little embedding example. The text is chunked fine, and embedded. But on retrieval, the I get this weird message.
Then if I go to the cli and try running the same model, I get the same message.

I'm pretty sure it has to do with having the embedded model in memory at the same time as llama3.2

Also at the time only the embedded model was in memory, llama3.2 was supposed to be (and had just been used) but is no longer there, which I think is a big clue

(venv-game) chris@FORGE:~/game$ ollama run llama3.2:latest
Error: llama runner process has terminated: CUDA error
(venv-game) chris@FORGE:~/game$ ollama ps
NAME                       ID              SIZE      PROCESSOR    UNTIL                   
nomic-embed-text:latest    0a109f422b47    849 MB    100% GPU     About a minute from now    
(venv-game) chris@FORGE:~/game$ ollama run llama3.2:latest
Error: llama runner process has terminated: CUDA error
(venv-game) chris@FORGE:~/game$ nvidia-smi
Tue Dec 31 00:33:00 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
|  0%   42C    P8             23W /  450W |    2269MiB /  24564MiB |     16%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2454      G   /usr/lib/xorg/Xorg                            653MiB |
|    0   N/A  N/A      3508      G   cinnamon                                       71MiB |
|    0   N/A  N/A      7665      G   ...seed-version=20241225-174432.450000        341MiB |
|    0   N/A  N/A    592670      G   ...erProcess --variations-seed-version         76MiB |
|    0   N/A  N/A   1334610      C   ...rs/cuda_v12_avx/ollama_llama_server        930MiB |
|    0   N/A  N/A   1663080      G   ...yOnDemand --variations-seed-version        105MiB |
|    0   N/A  N/A   3924325      G   ...erProcess --variations-seed-version         61MiB |
+-----------------------------------------------------------------------------------------+

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.5.4

Originally created by @iplayfast on GitHub (Dec 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8271 ### What is the issue? I'm making a little embedding example. The text is chunked fine, and embedded. But on retrieval, the I get this weird message. Then if I go to the cli and try running the same model, I get the same message. I'm pretty sure it has to do with having the embedded model in memory at the same time as llama3.2 **Also at the time only the embedded model was in memory, llama3.2 was supposed to be (and had just been used) but is no longer there, which I think is a big clue** ``` (venv-game) chris@FORGE:~/game$ ollama run llama3.2:latest Error: llama runner process has terminated: CUDA error (venv-game) chris@FORGE:~/game$ ollama ps NAME ID SIZE PROCESSOR UNTIL nomic-embed-text:latest 0a109f422b47 849 MB 100% GPU About a minute from now (venv-game) chris@FORGE:~/game$ ollama run llama3.2:latest Error: llama runner process has terminated: CUDA error (venv-game) chris@FORGE:~/game$ nvidia-smi Tue Dec 31 00:33:00 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 0% 42C P8 23W / 450W | 2269MiB / 24564MiB | 16% E. Process | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2454 G /usr/lib/xorg/Xorg 653MiB | | 0 N/A N/A 3508 G cinnamon 71MiB | | 0 N/A N/A 7665 G ...seed-version=20241225-174432.450000 341MiB | | 0 N/A N/A 592670 G ...erProcess --variations-seed-version 76MiB | | 0 N/A N/A 1334610 C ...rs/cuda_v12_avx/ollama_llama_server 930MiB | | 0 N/A N/A 1663080 G ...yOnDemand --variations-seed-version 105MiB | | 0 N/A N/A 3924325 G ...erProcess --variations-seed-version 61MiB | +-----------------------------------------------------------------------------------------+ ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.5.4
GiteaMirror added the needs more infobug labels 2026-04-12 16:28:04 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 31, 2024):

Server logs will aid in debugging.

<!-- gh-comment-id:2566159138 --> @rick-github commented on GitHub (Dec 31, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

The code that causes the problem.
https://github.com/iplayfast/OllamaPlayground/tree/main/WorkingExamples/rag2 and some server logs

ec 31 01:21:27 FORGE ollama[555324]: runtime.chanrecv1(0x0?, 0x0?)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/chan.go:489 +0x12 fp=0xc0000887b8 sp=0xc000088790 pc=0x5aa5db4d3952
Dec 31 01:21:27 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/mgc.go:1781
Dec 31 01:21:27 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/mgc.go:1784 +0x2f fp=0xc0000887e0 sp=0xc0000887b8 pc=0x5aa5db4e6f0f
Dec 31 01:21:27 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000887e8 sp=0xc0000887e0 pc=0x5aa5db540561
Dec 31 01:21:27 FORGE ollama[555324]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/mgc.go:1779 +0x96
Dec 31 01:21:27 FORGE ollama[555324]: goroutine 21 gp=0xc0001068c0 m=nil [semacquire]:
Dec 31 01:21:27 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0xe0?, 0x0?)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc000089618 sp=0xc0000895f8 pc=0x5aa5db53892e
Dec 31 01:21:27 FORGE ollama[555324]: runtime.goparkunlock(...)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/proc.go:430
Dec 31 01:21:27 FORGE ollama[555324]: runtime.semacquire1(0xc00013e1b8, 0x0, 0x1, 0x0, 0x12)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/sema.go:178 +0x22c fp=0xc000089680 sp=0xc000089618 pc=0x5aa5db517c4c
Dec 31 01:21:27 FORGE ollama[555324]: sync.runtime_Semacquire(0x0?)
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/sema.go:71 +0x25 fp=0xc0000896b8 sp=0xc000089680 pc=0x5aa5db539b65
Dec 31 01:21:27 FORGE ollama[555324]: sync.(*WaitGroup).Wait(0x0?)
Dec 31 01:21:27 FORGE ollama[555324]:         sync/waitgroup.go:118 +0x48 fp=0xc0000896e0 sp=0xc0000896b8 pc=0x5aa5db555e08
Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).run(0xc00013e1b0, {0x5aa5dbb77de0, 0xc00017e050})
Dec 31 01:21:27 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:315 +0x47 fp=0xc0000897b8 sp=0xc0000896e0 pc=0x5aa5db778487
Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap2()
Dec 31 01:21:27 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:984 +0x28 fp=0xc0000897e0 sp=0xc0000897b8 pc=0x5aa5db77d628
Dec 31 01:21:27 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:21:27 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000897e8 sp=0xc0000897e0 pc=0x5aa5db540561
Dec 31 01:21:27 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
Dec 31 01:21:27 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5
Dec 31 01:21:27 FORGE ollama[555324]: rax    0x0
Dec 31 01:21:27 FORGE ollama[555324]: rbx    0x18d90f
Dec 31 01:21:27 FORGE ollama[555324]: rcx    0x77de4889eb1c
Dec 31 01:21:27 FORGE ollama[555324]: rdx    0x6
Dec 31 01:21:27 FORGE ollama[555324]: rdi    0x18d90c
Dec 31 01:21:27 FORGE ollama[555324]: rsi    0x18d90f
Dec 31 01:21:27 FORGE ollama[555324]: rbp    0x77de009de330
Dec 31 01:21:27 FORGE ollama[555324]: rsp    0x77de009de2f0
Dec 31 01:21:27 FORGE ollama[555324]: r8     0x0
Dec 31 01:21:27 FORGE ollama[555324]: r9     0x0
Dec 31 01:21:27 FORGE ollama[555324]: r10    0x8
Dec 31 01:21:27 FORGE ollama[555324]: r11    0x246
Dec 31 01:21:27 FORGE ollama[555324]: r12    0x6
Dec 31 01:21:27 FORGE ollama[555324]: r13    0x60
Dec 31 01:21:27 FORGE ollama[555324]: r14    0x16
Dec 31 01:21:27 FORGE ollama[555324]: r15    0x77ddf0000c60
Dec 31 01:21:27 FORGE ollama[555324]: rip    0x77de4889eb1c
Dec 31 01:21:27 FORGE ollama[555324]: rflags 0x246
Dec 31 01:21:27 FORGE ollama[555324]: cs     0x33
Dec 31 01:21:27 FORGE ollama[555324]: fs     0x0
Dec 31 01:21:27 FORGE ollama[555324]: gs     0x0
Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.107-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:27 FORGE ollama[555324]: time=2024-12-31T01:21:27.157-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n  current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n  cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error"
Dec 31 01:21:27 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:21:27 | 500 |  297.582496ms |       127.0.0.1 | POST     "/api/generate"
Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.607-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:31 FORGE ollama[555324]: time=2024-12-31T01:21:31.858-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001746564 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.859-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.859-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:32 FORGE ollama[555324]: time=2024-12-31T01:21:32.108-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251566216 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:32 FORGE ollama[555324]: time=2024-12-31T01:21:32.358-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501623441 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.610-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.861-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.610-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.860-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:34 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:34.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:21:34 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:34.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
<!-- gh-comment-id:2566168725 --> @iplayfast commented on GitHub (Dec 31, 2024): The code that causes the problem. https://github.com/iplayfast/OllamaPlayground/tree/main/WorkingExamples/rag2 and some server logs ``` ec 31 01:21:27 FORGE ollama[555324]: runtime.chanrecv1(0x0?, 0x0?) Dec 31 01:21:27 FORGE ollama[555324]: runtime/chan.go:489 +0x12 fp=0xc0000887b8 sp=0xc000088790 pc=0x5aa5db4d3952 Dec 31 01:21:27 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.func1(...) Dec 31 01:21:27 FORGE ollama[555324]: runtime/mgc.go:1781 Dec 31 01:21:27 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() Dec 31 01:21:27 FORGE ollama[555324]: runtime/mgc.go:1784 +0x2f fp=0xc0000887e0 sp=0xc0000887b8 pc=0x5aa5db4e6f0f Dec 31 01:21:27 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:21:27 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000887e8 sp=0xc0000887e0 pc=0x5aa5db540561 Dec 31 01:21:27 FORGE ollama[555324]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1 Dec 31 01:21:27 FORGE ollama[555324]: runtime/mgc.go:1779 +0x96 Dec 31 01:21:27 FORGE ollama[555324]: goroutine 21 gp=0xc0001068c0 m=nil [semacquire]: Dec 31 01:21:27 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0xe0?, 0x0?) Dec 31 01:21:27 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc000089618 sp=0xc0000895f8 pc=0x5aa5db53892e Dec 31 01:21:27 FORGE ollama[555324]: runtime.goparkunlock(...) Dec 31 01:21:27 FORGE ollama[555324]: runtime/proc.go:430 Dec 31 01:21:27 FORGE ollama[555324]: runtime.semacquire1(0xc00013e1b8, 0x0, 0x1, 0x0, 0x12) Dec 31 01:21:27 FORGE ollama[555324]: runtime/sema.go:178 +0x22c fp=0xc000089680 sp=0xc000089618 pc=0x5aa5db517c4c Dec 31 01:21:27 FORGE ollama[555324]: sync.runtime_Semacquire(0x0?) Dec 31 01:21:27 FORGE ollama[555324]: runtime/sema.go:71 +0x25 fp=0xc0000896b8 sp=0xc000089680 pc=0x5aa5db539b65 Dec 31 01:21:27 FORGE ollama[555324]: sync.(*WaitGroup).Wait(0x0?) Dec 31 01:21:27 FORGE ollama[555324]: sync/waitgroup.go:118 +0x48 fp=0xc0000896e0 sp=0xc0000896b8 pc=0x5aa5db555e08 Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).run(0xc00013e1b0, {0x5aa5dbb77de0, 0xc00017e050}) Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:315 +0x47 fp=0xc0000897b8 sp=0xc0000896e0 pc=0x5aa5db778487 Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap2() Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:984 +0x28 fp=0xc0000897e0 sp=0xc0000897b8 pc=0x5aa5db77d628 Dec 31 01:21:27 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:21:27 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000897e8 sp=0xc0000897e0 pc=0x5aa5db540561 Dec 31 01:21:27 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 Dec 31 01:21:27 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5 Dec 31 01:21:27 FORGE ollama[555324]: rax 0x0 Dec 31 01:21:27 FORGE ollama[555324]: rbx 0x18d90f Dec 31 01:21:27 FORGE ollama[555324]: rcx 0x77de4889eb1c Dec 31 01:21:27 FORGE ollama[555324]: rdx 0x6 Dec 31 01:21:27 FORGE ollama[555324]: rdi 0x18d90c Dec 31 01:21:27 FORGE ollama[555324]: rsi 0x18d90f Dec 31 01:21:27 FORGE ollama[555324]: rbp 0x77de009de330 Dec 31 01:21:27 FORGE ollama[555324]: rsp 0x77de009de2f0 Dec 31 01:21:27 FORGE ollama[555324]: r8 0x0 Dec 31 01:21:27 FORGE ollama[555324]: r9 0x0 Dec 31 01:21:27 FORGE ollama[555324]: r10 0x8 Dec 31 01:21:27 FORGE ollama[555324]: r11 0x246 Dec 31 01:21:27 FORGE ollama[555324]: r12 0x6 Dec 31 01:21:27 FORGE ollama[555324]: r13 0x60 Dec 31 01:21:27 FORGE ollama[555324]: r14 0x16 Dec 31 01:21:27 FORGE ollama[555324]: r15 0x77ddf0000c60 Dec 31 01:21:27 FORGE ollama[555324]: rip 0x77de4889eb1c Dec 31 01:21:27 FORGE ollama[555324]: rflags 0x246 Dec 31 01:21:27 FORGE ollama[555324]: cs 0x33 Dec 31 01:21:27 FORGE ollama[555324]: fs 0x0 Dec 31 01:21:27 FORGE ollama[555324]: gs 0x0 Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.107-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:27 FORGE ollama[555324]: time=2024-12-31T01:21:27.157-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error" Dec 31 01:21:27 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:21:27 | 500 | 297.582496ms | 127.0.0.1 | POST "/api/generate" Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:27 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:27.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:28 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:28.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:29 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:29.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.608-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:30 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:30.858-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.358-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.607-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:31 FORGE ollama[555324]: time=2024-12-31T01:21:31.858-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001746564 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.859-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:31 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:31.859-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:32 FORGE ollama[555324]: time=2024-12-31T01:21:32.108-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251566216 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.108-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:32 FORGE ollama[555324]: time=2024-12-31T01:21:32.358-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501623441 model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.610-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:32 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:32.861-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.610-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:33 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:33.860-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:34 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:34.110-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:21:34 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:21:34.360-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" ```
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

I wonder if this is similar to https://github.com/ollama/ollama/issues/6637
The order of operations that I'm doing is.

  1. chunking a file and getting embeddings for it. ollama.embed(model = 'nomic-embed-text')
  2. save embeddings with chroma
  3. get embeddings for question (still using nomic-embed-text)
  4. retrieve embeddings from step 2 that relate to step 3
  5. retrieve a response using llama3.2 response = ollama.generate(model=self.response_model_name,prompt=final_prompt, **ollama_kwargs )
  6. step 5 fails, retry with cpu only ,
  7. fails as well.

If I wait between loading models for the current model to exit ollama it works. It seems to be a two model at the same time problem.

<!-- gh-comment-id:2566175399 --> @iplayfast commented on GitHub (Dec 31, 2024): I wonder if this is similar to https://github.com/ollama/ollama/issues/6637 The order of operations that I'm doing is. 1. chunking a file and getting embeddings for it. ollama.embed(model = 'nomic-embed-text') 2. save embeddings with chroma 3. get embeddings for question (still using nomic-embed-text) 4. retrieve embeddings from step 2 that relate to step 3 5. retrieve a response using llama3.2 response = ollama.generate(model=self.response_model_name,prompt=final_prompt, **ollama_kwargs ) 6. step 5 fails, retry with cpu only , 7. fails as well. If I wait between loading models for the current model to exit ollama it works. It seems to be a two model at the same time problem.
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

Tried an experiment from the command line that can recreate it.
in one terminal
ollama run qwq
in a second terminal
ollama run llama3.2:latest
in a third terminal

ollama run tinyllama
Error: llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable
  current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769
  cudaMemGetInfo(free, total)
llama/ggml-cuda/ggml-cuda.cu:96: CUDA error
nvidia-smi
Tue Dec 31 01:59:25 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
| 30%   43C    P2             59W /  450W |    4848MiB /  24564MiB |      2%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2454      G   /usr/lib/xorg/Xorg                            627MiB |
|    0   N/A  N/A      3508      G   cinnamon                                       81MiB |
|    0   N/A  N/A      7665      G   ...seed-version=20241225-174432.450000        157MiB |
|    0   N/A  N/A    592670      G   ...erProcess --variations-seed-version         59MiB |
|    0   N/A  N/A   1663080      G   ...yOnDemand --variations-seed-version        140MiB |
|    0   N/A  N/A   1808798      C   ...rs/cuda_v12_avx/ollama_llama_server       3692MiB |
|    0   N/A  N/A   3924325      G   ...erProcess --variations-seed-version         61MiB |
+-----------------------------------------------------------------------------------------+
chris@FORGE:~$ 

seems to happen pretty consistently.

Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.121-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001576273 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.371-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251697771 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.620-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500690492 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816

<!-- gh-comment-id:2566188361 --> @iplayfast commented on GitHub (Dec 31, 2024): Tried an experiment from the command line that can recreate it. in one terminal ollama run qwq in a second terminal ollama run llama3.2:latest in a third terminal ``` ollama run tinyllama Error: llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769 cudaMemGetInfo(free, total) llama/ggml-cuda/ggml-cuda.cu:96: CUDA error ``` ``` nvidia-smi Tue Dec 31 01:59:25 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 On | Off | | 30% 43C P2 59W / 450W | 4848MiB / 24564MiB | 2% E. Process | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2454 G /usr/lib/xorg/Xorg 627MiB | | 0 N/A N/A 3508 G cinnamon 81MiB | | 0 N/A N/A 7665 G ...seed-version=20241225-174432.450000 157MiB | | 0 N/A N/A 592670 G ...erProcess --variations-seed-version 59MiB | | 0 N/A N/A 1663080 G ...yOnDemand --variations-seed-version 140MiB | | 0 N/A N/A 1808798 C ...rs/cuda_v12_avx/ollama_llama_server 3692MiB | | 0 N/A N/A 3924325 G ...erProcess --variations-seed-version 61MiB | +-----------------------------------------------------------------------------------------+ chris@FORGE:~$ ``` seems to happen pretty consistently. ``` Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.121-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001576273 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.371-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251697771 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.620-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500690492 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 ```
Author
Owner

@rick-github commented on GitHub (Dec 31, 2024):

If you include full logs, it's easier to debug.

<!-- gh-comment-id:2566203212 --> @rick-github commented on GitHub (Dec 31, 2024): If you include full logs, it's easier to debug.
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

Dec 31 01:55:57 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
Dec 31 01:55:57 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5
Dec 31 01:55:57 FORGE ollama[555324]: rax    0x0
Dec 31 01:55:57 FORGE ollama[555324]: rbx    0x1b834e
Dec 31 01:55:57 FORGE ollama[555324]: rcx    0x77f67e49eb1c
Dec 31 01:55:57 FORGE ollama[555324]: rdx    0x6
Dec 31 01:55:57 FORGE ollama[555324]: rdi    0x1b833f
Dec 31 01:55:57 FORGE ollama[555324]: rsi    0x1b834e
Dec 31 01:55:57 FORGE ollama[555324]: rbp    0x77f6365de330
Dec 31 01:55:57 FORGE ollama[555324]: rsp    0x77f6365de2f0
Dec 31 01:55:57 FORGE ollama[555324]: r8     0x0
Dec 31 01:55:57 FORGE ollama[555324]: r9     0x0
Dec 31 01:55:57 FORGE ollama[555324]: r10    0x8
Dec 31 01:55:57 FORGE ollama[555324]: r11    0x246
Dec 31 01:55:57 FORGE ollama[555324]: r12    0x6
Dec 31 01:55:57 FORGE ollama[555324]: r13    0x60
Dec 31 01:55:57 FORGE ollama[555324]: r14    0x16
Dec 31 01:55:57 FORGE ollama[555324]: r15    0x77f61c000c60
Dec 31 01:55:57 FORGE ollama[555324]: rip    0x77f67e49eb1c
Dec 31 01:55:57 FORGE ollama[555324]: rflags 0x246
Dec 31 01:55:57 FORGE ollama[555324]: cs     0x33
Dec 31 01:55:57 FORGE ollama[555324]: fs     0x0
Dec 31 01:55:57 FORGE ollama[555324]: gs     0x0
Dec 31 01:55:57 FORGE ollama[555324]: time=2024-12-31T01:55:57.309-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n  current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n  cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error"
Dec 31 01:55:57 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:55:57 | 500 |  264.751885ms |       127.0.0.1 | POST     "/api/generate"
Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.309-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.060-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.561-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.060-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.310-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.311-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001662227 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.560-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250827495 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.561-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.810-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500989833 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:57:45 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:45 | 200 |       36.95µs |       127.0.0.1 | HEAD     "/"
Dec 31 01:57:45 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:45 | 200 |      49.719µs |       127.0.0.1 | GET      "/api/ps"
Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 |      23.165µs |       127.0.0.1 | HEAD     "/"
Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 |   10.945101ms |       127.0.0.1 | POST     "/api/show"
Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 |   13.489708ms |       127.0.0.1 | POST     "/api/generate"
Dec 31 01:57:52 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:52 | 200 |  336.628424ms |       127.0.0.1 | POST     "/api/chat"
Dec 31 01:58:00 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:00 | 200 |      25.838µs |       127.0.0.1 | HEAD     "/"
Dec 31 01:58:00 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:00 | 200 |   14.187616ms |       127.0.0.1 | POST     "/api/show"
Dec 31 01:58:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:00.345-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.362-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="2.1 GiB"
Dec 31 01:58:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:00.362-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.890-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23665770496 required="3.7 GiB"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=server.go:104 msg="system memory" total="125.6 GiB" free="100.8 GiB" free_swap="2.0 GiB"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 35619"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.975-05:00 level=INFO source=runner.go:945 msg="starting go runner"
Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: found 1 CUDA devices:
Dec 31 01:58:00 FORGE ollama[555324]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.980-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.980-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35619"
Dec 31 01:58:01 FORGE ollama[555324]: llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22569 MiB free
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   1:                               general.type str              = model
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   5:                         general.size_label str              = 3B
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   8:                          llama.block_count u32              = 28
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  18:                          general.file_type u32              = 15
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type  f32:   58 tensors
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type q4_K:  168 tensors
Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type q6_K:   29 tensors
Dec 31 01:58:01 FORGE ollama[555324]: time=2024-12-31T01:58:01.205-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_vocab: special tokens cache size = 256
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_vocab: token to piece cache size = 0.7999 MB
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: format           = GGUF V3 (latest)
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: arch             = llama
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: vocab type       = BPE
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_vocab          = 128256
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_merges         = 280147
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: vocab_only       = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ctx_train      = 131072
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd           = 3072
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_layer          = 28
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_head           = 24
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_head_kv        = 8
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_rot            = 128
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_swa            = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_head_k    = 128
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_head_v    = 128
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_gqa            = 3
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_k_gqa     = 1024
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_v_gqa     = 1024
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ff             = 8192
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_expert         = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_expert_used    = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: causal attn      = 1
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: pooling type     = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope type        = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope scaling     = linear
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: freq_base_train  = 500000.0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: freq_scale_train = 1
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope_finetuned   = unknown
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_conv       = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_inner      = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_state      = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_dt_rank      = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model type       = 3B
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model ftype      = Q4_K - Medium
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model params     = 3.21 B
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: LF token         = 128 'Ä'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: max token length = 256
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloading 28 repeating layers to GPU
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloading output layer to GPU
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloaded 29/29 layers to GPU
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors:        CUDA0 model buffer size =  1918.35 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_seq_max     = 4
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx         = 8192
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx_per_seq = 2048
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_batch       = 2048
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ubatch      = 512
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: flash_attn    = 0
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: freq_base     = 500000.0
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: freq_scale    = 1
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Dec 31 01:58:01 FORGE ollama[555324]: llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.00 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model:      CUDA0 compute buffer size =   424.00 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: graph nodes  = 902
Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: graph splits = 2
Dec 31 01:58:01 FORGE ollama[555324]: time=2024-12-31T01:58:01.707-05:00 level=INFO source=server.go:594 msg="llama runner started in 0.75 seconds"
Dec 31 01:58:01 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:01 | 200 |  1.386181264s |       127.0.0.1 | POST     "/api/generate"
Dec 31 01:58:05 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:05 | 200 |  155.649527ms |       127.0.0.1 | POST     "/api/chat"
Dec 31 01:58:08 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:08 | 200 |      27.639µs |       127.0.0.1 | HEAD     "/"
Dec 31 01:58:08 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:08 | 200 |   12.444823ms |       127.0.0.1 | POST     "/api/show"
Dec 31 01:58:08 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:08.852-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="19.9 GiB"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=21390725120 required="1.7 GiB"
Dec 31 01:58:08 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:08.866-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=server.go:104 msg="system memory" total="125.6 GiB" free="100.4 GiB" free_swap="2.0 GiB"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.867-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=23 layers.offload=23 layers.split="" memory.available="[19.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="176.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="696.1 MiB" memory.weights.repeating="644.8 MiB" memory.weights.nonrepeating="51.3 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="546.3 MiB"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.867-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 --ctx-size 8192 --batch-size 512 --n-gpu-layers 23 --threads 8 --parallel 4 --port 41533"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=2
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.900-05:00 level=INFO source=runner.go:945 msg="starting go runner"
Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: found 1 CUDA devices:
Dec 31 01:58:08 FORGE ollama[555324]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.904-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8
Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.904-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:41533"
Dec 31 01:58:08 FORGE ollama[555324]: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Dec 31 01:58:08 FORGE ollama[555324]:   current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769
Dec 31 01:58:08 FORGE ollama[555324]:   cudaMemGetInfo(free, total)
Dec 31 01:58:08 FORGE ollama[555324]: llama/ggml-cuda/ggml-cuda.cu:96: CUDA error
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806887]
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806886]
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806885]
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806884]
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806883]
Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806882]
Dec 31 01:58:09 FORGE ollama[1806889]: [Thread debugging using libthread_db enabled]
Dec 31 01:58:09 FORGE ollama[1806889]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Dec 31 01:58:09 FORGE ollama[1806889]: 0x00005b998528deae in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #0  0x00005b998528deae in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #1  0x00005b998528de45 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #2  0x00005b9985b519f8 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #3  0x0000000000000001 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #4  0x0000000000000024 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #5  0x00007ffce45d994d in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #6  0x00005b99afe64e30 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #7  0x00005b99afae5010 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #8  0x0000000000000007 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #9  0x00007ffce45d960c in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #10 0x00007ffce45d9c48 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #11 0x00005b99852480d2 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #12 0x0000040000000400 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #13 0x00007ffce45d960c in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: #14 0x00007bcd4347c658 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_empty_rep_storage () from /lib/x86_64-linux-gnu/libstdc++.so.6
Dec 31 01:58:09 FORGE ollama[1806889]: #15 0x00007bcd4347c658 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_empty_rep_storage () from /lib/x86_64-linux-gnu/libstdc++.so.6
Dec 31 01:58:09 FORGE ollama[1806889]: #16 0x0000000000000000 in ?? ()
Dec 31 01:58:09 FORGE ollama[1806889]: [Inferior 1 (process 1806881) detached]
Dec 31 01:58:09 FORGE ollama[555324]: SIGABRT: abort
Dec 31 01:58:09 FORGE ollama[555324]: PC=0x7bcd42e9eb1c m=3 sigcode=18446744073709551610
Dec 31 01:58:09 FORGE ollama[555324]: signal arrived during cgo execution
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 20 gp=0xc000106700 m=3 mp=0xc000092e08 [syscall]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.cgocall(0x5b99854c8970, 0xc000186b78)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/cgocall.go:167 +0x4b fp=0xc000186b50 sp=0xc000186b18 pc=0x5b998527cb2b
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama._Cfunc_llama_load_model_from_file(0x7bcce4000be0, {0x0, 0x17, 0x1, 0x0, 0x0, 0x0, 0x5b99854c8380, 0xc00018a000, 0x0, ...})
Dec 31 01:58:09 FORGE ollama[555324]:         _cgo_gotypes.go:700 +0x50 fp=0xc000186b78 sp=0xc000186b50 pc=0x5b9985327410
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama.LoadModelFromFile.func1({0x7ffce45dab0b?, 0x0?}, {0x0, 0x17, 0x1, 0x0, 0x0, 0x0, 0x5b99854c8380, 0xc00018a000, ...})
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/llama.go:311 +0x127 fp=0xc000186c78 sp=0xc000186b78 pc=0x5b998532a027
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama.LoadModelFromFile({0x7ffce45dab0b, 0x6e}, {0x17, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0001161b0, ...})
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/llama.go:311 +0x2d6 fp=0xc000186dc8 sp=0xc000186c78 pc=0x5b9985329d16
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc00013e1b0, {0x17, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0001161b0, 0x0}, ...)
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:859 +0xc5 fp=0xc000186f10 sp=0xc000186dc8 pc=0x5b99854c5de5
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap1()
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:979 +0xda fp=0xc000186fe0 sp=0xc000186f10 pc=0x5b99854c773a
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc000186fe8 sp=0xc000186fe0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:979 +0xd0d
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 1 gp=0xc0000061c0 m=nil [IO wait]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc00002d7b0 sp=0xc00002d790 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.netpollblock(0x10?, 0x8521b186?, 0x99?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/netpoll.go:575 +0xf7 fp=0xc00002d7e8 sp=0xc00002d7b0 pc=0x5b9985247697
Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.runtime_pollWait(0x7bcd6858cec0, 0x72)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/netpoll.go:351 +0x85 fp=0xc00002d808 sp=0xc00002d7e8 pc=0x5b9985281c25
Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*pollDesc).wait(0xc000176100?, 0x10?, 0x0)
Dec 31 01:58:09 FORGE ollama[555324]:         internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002d830 sp=0xc00002d808 pc=0x5b99852d7a67
Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*pollDesc).waitRead(...)
Dec 31 01:58:09 FORGE ollama[555324]:         internal/poll/fd_poll_runtime.go:89
Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*FD).Accept(0xc000176100)
Dec 31 01:58:09 FORGE ollama[555324]:         internal/poll/fd_unix.go:620 +0x295 fp=0xc00002d8d8 sp=0xc00002d830 pc=0x5b99852d8fd5
Dec 31 01:58:09 FORGE ollama[555324]: net.(*netFD).accept(0xc000176100)
Dec 31 01:58:09 FORGE ollama[555324]:         net/fd_unix.go:172 +0x29 fp=0xc00002d990 sp=0xc00002d8d8 pc=0x5b9985351969
Dec 31 01:58:09 FORGE ollama[555324]: net.(*TCPListener).accept(0xc0001266c0)
Dec 31 01:58:09 FORGE ollama[555324]:         net/tcpsock_posix.go:159 +0x1e fp=0xc00002d9e0 sp=0xc00002d990 pc=0x5b9985361fbe
Dec 31 01:58:09 FORGE ollama[555324]: net.(*TCPListener).Accept(0xc0001266c0)
Dec 31 01:58:09 FORGE ollama[555324]:         net/tcpsock.go:372 +0x30 fp=0xc00002da10 sp=0xc00002d9e0 pc=0x5b99853612f0
Dec 31 01:58:09 FORGE ollama[555324]: net/http.(*onceCloseListener).Accept(0x5b99858c1d38?)
Dec 31 01:58:09 FORGE ollama[555324]:         <autogenerated>:1 +0x24 fp=0xc00002da28 sp=0xc00002da10 pc=0x5b998549fec4
Dec 31 01:58:09 FORGE ollama[555324]: net/http.(*Server).Serve(0xc0001744b0, {0x5b99858c17f8, 0xc0001266c0})
Dec 31 01:58:09 FORGE ollama[555324]:         net/http/server.go:3330 +0x30c fp=0xc00002db58 sp=0xc00002da28 pc=0x5b9985491c0c
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute({0xc00012a010?, 0x5b998528a1bc?, 0x0?})
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:1005 +0x11a9 fp=0xc00002def8 sp=0xc00002db58 pc=0x5b99854c7309
Dec 31 01:58:09 FORGE ollama[555324]: main.main()
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/cmd/runner/main.go:11 +0x54 fp=0xc00002df50 sp=0xc00002def8 pc=0x5b99854c8294
Dec 31 01:58:09 FORGE ollama[555324]: runtime.main()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:272 +0x29d fp=0xc00002dfe0 sp=0xc00002df50 pc=0x5b998524ec7d
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00002dfe8 sp=0xc00002dfe0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc00008cfa8 sp=0xc00008cf88 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:430
Dec 31 01:58:09 FORGE ollama[555324]: runtime.forcegchelper()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:337 +0xb8 fp=0xc00008cfe0 sp=0xc00008cfa8 pc=0x5b998524efb8
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.init.7 in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:325 +0x1a
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc00008d780 sp=0xc00008d760 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:430
Dec 31 01:58:09 FORGE ollama[555324]: runtime.bgsweep(0xc0000ba000)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgcsweep.go:277 +0x94 fp=0xc00008d7c8 sp=0xc00008d780 pc=0x5b99852397f4
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gcenable.gowrap1()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:204 +0x25 fp=0xc00008d7e0 sp=0xc00008d7c8 pc=0x5b998522e0a5
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00008d7e8 sp=0xc00008d7e0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.gcenable in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:204 +0x66
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0xc0000ba000?, 0x5b99857a2e60?, 0x1?, 0x0?, 0xc000007340?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc00008df78 sp=0xc00008df58 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:430
Dec 31 01:58:09 FORGE ollama[555324]: runtime.(*scavengerState).park(0x5b9985aad060)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgcscavenge.go:425 +0x49 fp=0xc00008dfa8 sp=0xc00008df78 pc=0x5b9985237229
Dec 31 01:58:09 FORGE ollama[555324]: runtime.bgscavenge(0xc0000ba000)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgcscavenge.go:653 +0x3c fp=0xc00008dfc8 sp=0xc00008dfa8 pc=0x5b998523779c
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gcenable.gowrap2()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:205 +0x25 fp=0xc00008dfe0 sp=0xc00008dfc8 pc=0x5b998522e045
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.gcenable in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:205 +0xa5
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 18 gp=0xc000106380 m=nil [finalizer wait]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0xc00008c648?, 0x5b99852245a5?, 0xb0?, 0x1?, 0xc0000061c0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc00008c620 sp=0xc00008c600 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.runfinq()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mfinal.go:193 +0x107 fp=0xc00008c7e0 sp=0xc00008c620 pc=0x5b998522d127
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc00008c7e8 sp=0xc00008c7e0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.createfing in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mfinal.go:163 +0x3d
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 19 gp=0xc000106540 m=nil [chan receive]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc000088718 sp=0xc0000886f8 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.chanrecv(0xc0001000e0, 0x0, 0x1)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/chan.go:639 +0x41c fp=0xc000088790 sp=0xc000088718 pc=0x5b998521dd7c
Dec 31 01:58:09 FORGE ollama[555324]: runtime.chanrecv1(0x0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/chan.go:489 +0x12 fp=0xc0000887b8 sp=0xc000088790 pc=0x5b998521d952
Dec 31 01:58:09 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.func1(...)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:1781
Dec 31 01:58:09 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:1784 +0x2f fp=0xc0000887e0 sp=0xc0000887b8 pc=0x5b9985230f0f
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000887e8 sp=0xc0000887e0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/mgc.go:1779 +0x96
Dec 31 01:58:09 FORGE ollama[555324]: goroutine 21 gp=0xc0001068c0 m=nil [semacquire]:
Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0xc0?, 0xe0?, 0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:424 +0xce fp=0xc000089618 sp=0xc0000895f8 pc=0x5b998528292e
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/proc.go:430
Dec 31 01:58:09 FORGE ollama[555324]: runtime.semacquire1(0xc00013e1b8, 0x0, 0x1, 0x0, 0x12)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/sema.go:178 +0x22c fp=0xc000089680 sp=0xc000089618 pc=0x5b9985261c4c
Dec 31 01:58:09 FORGE ollama[555324]: sync.runtime_Semacquire(0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/sema.go:71 +0x25 fp=0xc0000896b8 sp=0xc000089680 pc=0x5b9985283b65
Dec 31 01:58:09 FORGE ollama[555324]: sync.(*WaitGroup).Wait(0x0?)
Dec 31 01:58:09 FORGE ollama[555324]:         sync/waitgroup.go:118 +0x48 fp=0xc0000896e0 sp=0xc0000896b8 pc=0x5b998529fe08
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).run(0xc00013e1b0, {0x5b99858c1de0, 0xc00017e050})
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:315 +0x47 fp=0xc0000897b8 sp=0xc0000896e0 pc=0x5b99854c2487
Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap2()
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:984 +0x28 fp=0xc0000897e0 sp=0xc0000897b8 pc=0x5b99854c7628
Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({})
Dec 31 01:58:09 FORGE ollama[555324]:         runtime/asm_amd64.s:1700 +0x1 fp=0xc0000897e8 sp=0xc0000897e0 pc=0x5b998528a561
Dec 31 01:58:09 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
Dec 31 01:58:09 FORGE ollama[555324]:         github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5
Dec 31 01:58:09 FORGE ollama[555324]: rax    0x0
Dec 31 01:58:09 FORGE ollama[555324]: rbx    0x1b9223
Dec 31 01:58:09 FORGE ollama[555324]: rcx    0x7bcd42e9eb1c
Dec 31 01:58:09 FORGE ollama[555324]: rdx    0x6
Dec 31 01:58:09 FORGE ollama[555324]: rdi    0x1b9221
Dec 31 01:58:09 FORGE ollama[555324]: rsi    0x1b9223
Dec 31 01:58:09 FORGE ollama[555324]: rbp    0x7bccfb9de330
Dec 31 01:58:09 FORGE ollama[555324]: rsp    0x7bccfb9de2f0
Dec 31 01:58:09 FORGE ollama[555324]: r8     0x0
Dec 31 01:58:09 FORGE ollama[555324]: r9     0x0
Dec 31 01:58:09 FORGE ollama[555324]: r10    0x8
Dec 31 01:58:09 FORGE ollama[555324]: r11    0x246
Dec 31 01:58:09 FORGE ollama[555324]: r12    0x6
Dec 31 01:58:09 FORGE ollama[555324]: r13    0x60
Dec 31 01:58:09 FORGE ollama[555324]: r14    0x16
Dec 31 01:58:09 FORGE ollama[555324]: r15    0x7bcce4000c60
Dec 31 01:58:09 FORGE ollama[555324]: rip    0x7bcd42e9eb1c
Dec 31 01:58:09 FORGE ollama[555324]: rflags 0x246
Dec 31 01:58:09 FORGE ollama[555324]: cs     0x33
Dec 31 01:58:09 FORGE ollama[555324]: fs     0x0
Dec 31 01:58:09 FORGE ollama[555324]: gs     0x0
Dec 31 01:58:09 FORGE ollama[555324]: time=2024-12-31T01:58:09.119-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n  current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n  cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error"
Dec 31 01:58:09 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:09 | 500 |  274.433471ms |       127.0.0.1 | POST     "/api/generate"
Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.120-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.872-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.622-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.872-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.121-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001576273 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.371-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251697771 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory"
Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.620-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500690492 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816
chris@FORGE:~/ai/aiprojects/OllamaPlayground/WorkingExamples/rag2$ 

are you able to recreate it?

<!-- gh-comment-id:2566205085 --> @iplayfast commented on GitHub (Dec 31, 2024): ```Dec 31 01:55:57 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008f7e8 sp=0xc00008f7e0 pc=0x56be7cad5561 Dec 31 01:55:57 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 Dec 31 01:55:57 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5 Dec 31 01:55:57 FORGE ollama[555324]: rax 0x0 Dec 31 01:55:57 FORGE ollama[555324]: rbx 0x1b834e Dec 31 01:55:57 FORGE ollama[555324]: rcx 0x77f67e49eb1c Dec 31 01:55:57 FORGE ollama[555324]: rdx 0x6 Dec 31 01:55:57 FORGE ollama[555324]: rdi 0x1b833f Dec 31 01:55:57 FORGE ollama[555324]: rsi 0x1b834e Dec 31 01:55:57 FORGE ollama[555324]: rbp 0x77f6365de330 Dec 31 01:55:57 FORGE ollama[555324]: rsp 0x77f6365de2f0 Dec 31 01:55:57 FORGE ollama[555324]: r8 0x0 Dec 31 01:55:57 FORGE ollama[555324]: r9 0x0 Dec 31 01:55:57 FORGE ollama[555324]: r10 0x8 Dec 31 01:55:57 FORGE ollama[555324]: r11 0x246 Dec 31 01:55:57 FORGE ollama[555324]: r12 0x6 Dec 31 01:55:57 FORGE ollama[555324]: r13 0x60 Dec 31 01:55:57 FORGE ollama[555324]: r14 0x16 Dec 31 01:55:57 FORGE ollama[555324]: r15 0x77f61c000c60 Dec 31 01:55:57 FORGE ollama[555324]: rip 0x77f67e49eb1c Dec 31 01:55:57 FORGE ollama[555324]: rflags 0x246 Dec 31 01:55:57 FORGE ollama[555324]: cs 0x33 Dec 31 01:55:57 FORGE ollama[555324]: fs 0x0 Dec 31 01:55:57 FORGE ollama[555324]: gs 0x0 Dec 31 01:55:57 FORGE ollama[555324]: time=2024-12-31T01:55:57.309-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error" Dec 31 01:55:57 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:55:57 | 500 | 264.751885ms | 127.0.0.1 | POST "/api/generate" Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.309-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:57 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:57.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.060-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.561-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:58 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:58.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:55:59 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:55:59.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.060-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:00.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.310-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.560-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:01 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:01.811-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.061-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.311-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001662227 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.311-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.560-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250827495 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:56:02 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:56:02.561-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:56:02 FORGE ollama[555324]: time=2024-12-31T01:56:02.810-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500989833 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:57:45 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:45 | 200 | 36.95µs | 127.0.0.1 | HEAD "/" Dec 31 01:57:45 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:45 | 200 | 49.719µs | 127.0.0.1 | GET "/api/ps" Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 | 23.165µs | 127.0.0.1 | HEAD "/" Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 | 10.945101ms | 127.0.0.1 | POST "/api/show" Dec 31 01:57:48 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:48 | 200 | 13.489708ms | 127.0.0.1 | POST "/api/generate" Dec 31 01:57:52 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:57:52 | 200 | 336.628424ms | 127.0.0.1 | POST "/api/chat" Dec 31 01:58:00 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:00 | 200 | 25.838µs | 127.0.0.1 | HEAD "/" Dec 31 01:58:00 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:00 | 200 | 14.187616ms | 127.0.0.1 | POST "/api/show" Dec 31 01:58:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:00.345-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.362-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="2.1 GiB" Dec 31 01:58:00 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:00.362-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.890-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=23665770496 required="3.7 GiB" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=server.go:104 msg="system memory" total="125.6 GiB" free="100.8 GiB" free_swap="2.0 GiB" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.953-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 35619" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.954-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.975-05:00 level=INFO source=runner.go:945 msg="starting go runner" Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Dec 31 01:58:00 FORGE ollama[555324]: ggml_cuda_init: found 1 CUDA devices: Dec 31 01:58:00 FORGE ollama[555324]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.980-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 Dec 31 01:58:00 FORGE ollama[555324]: time=2024-12-31T01:58:00.980-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:35619" Dec 31 01:58:01 FORGE ollama[555324]: llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22569 MiB free Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 0: general.architecture str = llama Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 1: general.type str = model Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 3: general.finetune str = Instruct Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 4: general.basename str = Llama-3.2 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 5: general.size_label str = 3B Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 8: llama.block_count u32 = 28 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 9: llama.context_length u32 = 131072 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 18: general.file_type u32 = 15 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type f32: 58 tensors Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type q4_K: 168 tensors Dec 31 01:58:01 FORGE ollama[555324]: llama_model_loader: - type q6_K: 29 tensors Dec 31 01:58:01 FORGE ollama[555324]: time=2024-12-31T01:58:01.205-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" Dec 31 01:58:01 FORGE ollama[555324]: llm_load_vocab: special tokens cache size = 256 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_vocab: token to piece cache size = 0.7999 MB Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: format = GGUF V3 (latest) Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: arch = llama Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: vocab type = BPE Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_vocab = 128256 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_merges = 280147 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: vocab_only = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ctx_train = 131072 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd = 3072 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_layer = 28 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_head = 24 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_head_kv = 8 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_rot = 128 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_swa = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_head_k = 128 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_head_v = 128 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_gqa = 3 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_k_gqa = 1024 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_embd_v_gqa = 1024 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_norm_eps = 0.0e+00 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: f_logit_scale = 0.0e+00 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ff = 8192 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_expert = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_expert_used = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: causal attn = 1 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: pooling type = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope type = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope scaling = linear Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: freq_base_train = 500000.0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: freq_scale_train = 1 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: rope_finetuned = unknown Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_conv = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_inner = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_d_state = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_dt_rank = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model type = 3B Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model ftype = Q4_K - Medium Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model params = 3.21 B Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: general.name = Llama 3.2 3B Instruct Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: LF token = 128 'Ä' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Dec 31 01:58:01 FORGE ollama[555324]: llm_load_print_meta: max token length = 256 Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloading 28 repeating layers to GPU Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloading output layer to GPU Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: offloaded 29/29 layers to GPU Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: CPU_Mapped model buffer size = 308.23 MiB Dec 31 01:58:01 FORGE ollama[555324]: llm_load_tensors: CUDA0 model buffer size = 1918.35 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_seq_max = 4 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx = 8192 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx_per_seq = 2048 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_batch = 2048 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ubatch = 512 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: flash_attn = 0 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: freq_base = 500000.0 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: freq_scale = 1 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Dec 31 01:58:01 FORGE ollama[555324]: llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: graph nodes = 902 Dec 31 01:58:01 FORGE ollama[555324]: llama_new_context_with_model: graph splits = 2 Dec 31 01:58:01 FORGE ollama[555324]: time=2024-12-31T01:58:01.707-05:00 level=INFO source=server.go:594 msg="llama runner started in 0.75 seconds" Dec 31 01:58:01 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:01 | 200 | 1.386181264s | 127.0.0.1 | POST "/api/generate" Dec 31 01:58:05 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:05 | 200 | 155.649527ms | 127.0.0.1 | POST "/api/chat" Dec 31 01:58:08 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:08 | 200 | 27.639µs | 127.0.0.1 | HEAD "/" Dec 31 01:58:08 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:08 | 200 | 12.444823ms | 127.0.0.1 | POST "/api/show" Dec 31 01:58:08 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:08.852-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=sched.go:507 msg="updated VRAM based on existing loaded models" gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 library=cuda total="23.6 GiB" available="19.9 GiB" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 gpu=GPU-ee6fccb6-c2ba-ccdf-f4c0-9f242c374a86 parallel=4 available=21390725120 required="1.7 GiB" Dec 31 01:58:08 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:08.866-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.866-05:00 level=INFO source=server.go:104 msg="system memory" total="125.6 GiB" free="100.4 GiB" free_swap="2.0 GiB" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.867-05:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=23 layers.offload=23 layers.split="" memory.available="[19.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="1.7 GiB" memory.required.kv="176.0 MiB" memory.required.allocations="[1.7 GiB]" memory.weights.total="696.1 MiB" memory.weights.repeating="644.8 MiB" memory.weights.nonrepeating="51.3 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="546.3 MiB" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.867-05:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/local/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 --ctx-size 8192 --batch-size 512 --n-gpu-layers 23 --threads 8 --parallel 4 --port 41533" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=sched.go:449 msg="loaded runners" count=2 Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.868-05:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.900-05:00 level=INFO source=runner.go:945 msg="starting go runner" Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Dec 31 01:58:08 FORGE ollama[555324]: ggml_cuda_init: found 1 CUDA devices: Dec 31 01:58:08 FORGE ollama[555324]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.904-05:00 level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 Dec 31 01:58:08 FORGE ollama[555324]: time=2024-12-31T01:58:08.904-05:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:41533" Dec 31 01:58:08 FORGE ollama[555324]: CUDA error: CUDA-capable device(s) is/are busy or unavailable Dec 31 01:58:08 FORGE ollama[555324]: current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769 Dec 31 01:58:08 FORGE ollama[555324]: cudaMemGetInfo(free, total) Dec 31 01:58:08 FORGE ollama[555324]: llama/ggml-cuda/ggml-cuda.cu:96: CUDA error Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806887] Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806886] Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806885] Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806884] Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806883] Dec 31 01:58:08 FORGE ollama[1806889]: [New LWP 1806882] Dec 31 01:58:09 FORGE ollama[1806889]: [Thread debugging using libthread_db enabled] Dec 31 01:58:09 FORGE ollama[1806889]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Dec 31 01:58:09 FORGE ollama[1806889]: 0x00005b998528deae in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #0 0x00005b998528deae in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #1 0x00005b998528de45 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #2 0x00005b9985b519f8 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #3 0x0000000000000001 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #4 0x0000000000000024 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #5 0x00007ffce45d994d in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #6 0x00005b99afe64e30 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #7 0x00005b99afae5010 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #8 0x0000000000000007 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #9 0x00007ffce45d960c in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #10 0x00007ffce45d9c48 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #11 0x00005b99852480d2 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #12 0x0000040000000400 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #13 0x00007ffce45d960c in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: #14 0x00007bcd4347c658 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_empty_rep_storage () from /lib/x86_64-linux-gnu/libstdc++.so.6 Dec 31 01:58:09 FORGE ollama[1806889]: #15 0x00007bcd4347c658 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_S_empty_rep_storage () from /lib/x86_64-linux-gnu/libstdc++.so.6 Dec 31 01:58:09 FORGE ollama[1806889]: #16 0x0000000000000000 in ?? () Dec 31 01:58:09 FORGE ollama[1806889]: [Inferior 1 (process 1806881) detached] Dec 31 01:58:09 FORGE ollama[555324]: SIGABRT: abort Dec 31 01:58:09 FORGE ollama[555324]: PC=0x7bcd42e9eb1c m=3 sigcode=18446744073709551610 Dec 31 01:58:09 FORGE ollama[555324]: signal arrived during cgo execution Dec 31 01:58:09 FORGE ollama[555324]: goroutine 20 gp=0xc000106700 m=3 mp=0xc000092e08 [syscall]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.cgocall(0x5b99854c8970, 0xc000186b78) Dec 31 01:58:09 FORGE ollama[555324]: runtime/cgocall.go:167 +0x4b fp=0xc000186b50 sp=0xc000186b18 pc=0x5b998527cb2b Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama._Cfunc_llama_load_model_from_file(0x7bcce4000be0, {0x0, 0x17, 0x1, 0x0, 0x0, 0x0, 0x5b99854c8380, 0xc00018a000, 0x0, ...}) Dec 31 01:58:09 FORGE ollama[555324]: _cgo_gotypes.go:700 +0x50 fp=0xc000186b78 sp=0xc000186b50 pc=0x5b9985327410 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama.LoadModelFromFile.func1({0x7ffce45dab0b?, 0x0?}, {0x0, 0x17, 0x1, 0x0, 0x0, 0x0, 0x5b99854c8380, 0xc00018a000, ...}) Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/llama.go:311 +0x127 fp=0xc000186c78 sp=0xc000186b78 pc=0x5b998532a027 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama.LoadModelFromFile({0x7ffce45dab0b, 0x6e}, {0x17, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0001161b0, ...}) Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/llama.go:311 +0x2d6 fp=0xc000186dc8 sp=0xc000186c78 pc=0x5b9985329d16 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc00013e1b0, {0x17, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc0001161b0, 0x0}, ...) Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:859 +0xc5 fp=0xc000186f10 sp=0xc000186dc8 pc=0x5b99854c5de5 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap1() Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:979 +0xda fp=0xc000186fe0 sp=0xc000186f10 pc=0x5b99854c773a Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc000186fe8 sp=0xc000186fe0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:979 +0xd0d Dec 31 01:58:09 FORGE ollama[555324]: goroutine 1 gp=0xc0000061c0 m=nil [IO wait]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc00002d7b0 sp=0xc00002d790 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.netpollblock(0x10?, 0x8521b186?, 0x99?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/netpoll.go:575 +0xf7 fp=0xc00002d7e8 sp=0xc00002d7b0 pc=0x5b9985247697 Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.runtime_pollWait(0x7bcd6858cec0, 0x72) Dec 31 01:58:09 FORGE ollama[555324]: runtime/netpoll.go:351 +0x85 fp=0xc00002d808 sp=0xc00002d7e8 pc=0x5b9985281c25 Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*pollDesc).wait(0xc000176100?, 0x10?, 0x0) Dec 31 01:58:09 FORGE ollama[555324]: internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00002d830 sp=0xc00002d808 pc=0x5b99852d7a67 Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*pollDesc).waitRead(...) Dec 31 01:58:09 FORGE ollama[555324]: internal/poll/fd_poll_runtime.go:89 Dec 31 01:58:09 FORGE ollama[555324]: internal/poll.(*FD).Accept(0xc000176100) Dec 31 01:58:09 FORGE ollama[555324]: internal/poll/fd_unix.go:620 +0x295 fp=0xc00002d8d8 sp=0xc00002d830 pc=0x5b99852d8fd5 Dec 31 01:58:09 FORGE ollama[555324]: net.(*netFD).accept(0xc000176100) Dec 31 01:58:09 FORGE ollama[555324]: net/fd_unix.go:172 +0x29 fp=0xc00002d990 sp=0xc00002d8d8 pc=0x5b9985351969 Dec 31 01:58:09 FORGE ollama[555324]: net.(*TCPListener).accept(0xc0001266c0) Dec 31 01:58:09 FORGE ollama[555324]: net/tcpsock_posix.go:159 +0x1e fp=0xc00002d9e0 sp=0xc00002d990 pc=0x5b9985361fbe Dec 31 01:58:09 FORGE ollama[555324]: net.(*TCPListener).Accept(0xc0001266c0) Dec 31 01:58:09 FORGE ollama[555324]: net/tcpsock.go:372 +0x30 fp=0xc00002da10 sp=0xc00002d9e0 pc=0x5b99853612f0 Dec 31 01:58:09 FORGE ollama[555324]: net/http.(*onceCloseListener).Accept(0x5b99858c1d38?) Dec 31 01:58:09 FORGE ollama[555324]: <autogenerated>:1 +0x24 fp=0xc00002da28 sp=0xc00002da10 pc=0x5b998549fec4 Dec 31 01:58:09 FORGE ollama[555324]: net/http.(*Server).Serve(0xc0001744b0, {0x5b99858c17f8, 0xc0001266c0}) Dec 31 01:58:09 FORGE ollama[555324]: net/http/server.go:3330 +0x30c fp=0xc00002db58 sp=0xc00002da28 pc=0x5b9985491c0c Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute({0xc00012a010?, 0x5b998528a1bc?, 0x0?}) Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:1005 +0x11a9 fp=0xc00002def8 sp=0xc00002db58 pc=0x5b99854c7309 Dec 31 01:58:09 FORGE ollama[555324]: main.main() Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/cmd/runner/main.go:11 +0x54 fp=0xc00002df50 sp=0xc00002def8 pc=0x5b99854c8294 Dec 31 01:58:09 FORGE ollama[555324]: runtime.main() Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:272 +0x29d fp=0xc00002dfe0 sp=0xc00002df50 pc=0x5b998524ec7d Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00002dfe8 sp=0xc00002dfe0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: goroutine 2 gp=0xc000006c40 m=nil [force gc (idle)]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc00008cfa8 sp=0xc00008cf88 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:430 Dec 31 01:58:09 FORGE ollama[555324]: runtime.forcegchelper() Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:337 +0xb8 fp=0xc00008cfe0 sp=0xc00008cfa8 pc=0x5b998524efb8 Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.init.7 in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:325 +0x1a Dec 31 01:58:09 FORGE ollama[555324]: goroutine 3 gp=0xc000007180 m=nil [GC sweep wait]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc00008d780 sp=0xc00008d760 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:430 Dec 31 01:58:09 FORGE ollama[555324]: runtime.bgsweep(0xc0000ba000) Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgcsweep.go:277 +0x94 fp=0xc00008d7c8 sp=0xc00008d780 pc=0x5b99852397f4 Dec 31 01:58:09 FORGE ollama[555324]: runtime.gcenable.gowrap1() Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:204 +0x25 fp=0xc00008d7e0 sp=0xc00008d7c8 pc=0x5b998522e0a5 Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008d7e8 sp=0xc00008d7e0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.gcenable in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:204 +0x66 Dec 31 01:58:09 FORGE ollama[555324]: goroutine 4 gp=0xc000007340 m=nil [GC scavenge wait]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0xc0000ba000?, 0x5b99857a2e60?, 0x1?, 0x0?, 0xc000007340?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc00008df78 sp=0xc00008df58 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:430 Dec 31 01:58:09 FORGE ollama[555324]: runtime.(*scavengerState).park(0x5b9985aad060) Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgcscavenge.go:425 +0x49 fp=0xc00008dfa8 sp=0xc00008df78 pc=0x5b9985237229 Dec 31 01:58:09 FORGE ollama[555324]: runtime.bgscavenge(0xc0000ba000) Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgcscavenge.go:653 +0x3c fp=0xc00008dfc8 sp=0xc00008dfa8 pc=0x5b998523779c Dec 31 01:58:09 FORGE ollama[555324]: runtime.gcenable.gowrap2() Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:205 +0x25 fp=0xc00008dfe0 sp=0xc00008dfc8 pc=0x5b998522e045 Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008dfe8 sp=0xc00008dfe0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.gcenable in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:205 +0xa5 Dec 31 01:58:09 FORGE ollama[555324]: goroutine 18 gp=0xc000106380 m=nil [finalizer wait]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0xc00008c648?, 0x5b99852245a5?, 0xb0?, 0x1?, 0xc0000061c0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc00008c620 sp=0xc00008c600 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.runfinq() Dec 31 01:58:09 FORGE ollama[555324]: runtime/mfinal.go:193 +0x107 fp=0xc00008c7e0 sp=0xc00008c620 pc=0x5b998522d127 Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc00008c7e8 sp=0xc00008c7e0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by runtime.createfing in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: runtime/mfinal.go:163 +0x3d Dec 31 01:58:09 FORGE ollama[555324]: goroutine 19 gp=0xc000106540 m=nil [chan receive]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc000088718 sp=0xc0000886f8 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.chanrecv(0xc0001000e0, 0x0, 0x1) Dec 31 01:58:09 FORGE ollama[555324]: runtime/chan.go:639 +0x41c fp=0xc000088790 sp=0xc000088718 pc=0x5b998521dd7c Dec 31 01:58:09 FORGE ollama[555324]: runtime.chanrecv1(0x0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/chan.go:489 +0x12 fp=0xc0000887b8 sp=0xc000088790 pc=0x5b998521d952 Dec 31 01:58:09 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.func1(...) Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:1781 Dec 31 01:58:09 FORGE ollama[555324]: runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:1784 +0x2f fp=0xc0000887e0 sp=0xc0000887b8 pc=0x5b9985230f0f Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000887e8 sp=0xc0000887e0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by unique.runtime_registerUniqueMapCleanup in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: runtime/mgc.go:1779 +0x96 Dec 31 01:58:09 FORGE ollama[555324]: goroutine 21 gp=0xc0001068c0 m=nil [semacquire]: Dec 31 01:58:09 FORGE ollama[555324]: runtime.gopark(0x0?, 0x0?, 0xc0?, 0xe0?, 0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:424 +0xce fp=0xc000089618 sp=0xc0000895f8 pc=0x5b998528292e Dec 31 01:58:09 FORGE ollama[555324]: runtime.goparkunlock(...) Dec 31 01:58:09 FORGE ollama[555324]: runtime/proc.go:430 Dec 31 01:58:09 FORGE ollama[555324]: runtime.semacquire1(0xc00013e1b8, 0x0, 0x1, 0x0, 0x12) Dec 31 01:58:09 FORGE ollama[555324]: runtime/sema.go:178 +0x22c fp=0xc000089680 sp=0xc000089618 pc=0x5b9985261c4c Dec 31 01:58:09 FORGE ollama[555324]: sync.runtime_Semacquire(0x0?) Dec 31 01:58:09 FORGE ollama[555324]: runtime/sema.go:71 +0x25 fp=0xc0000896b8 sp=0xc000089680 pc=0x5b9985283b65 Dec 31 01:58:09 FORGE ollama[555324]: sync.(*WaitGroup).Wait(0x0?) Dec 31 01:58:09 FORGE ollama[555324]: sync/waitgroup.go:118 +0x48 fp=0xc0000896e0 sp=0xc0000896b8 pc=0x5b998529fe08 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.(*Server).run(0xc00013e1b0, {0x5b99858c1de0, 0xc00017e050}) Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:315 +0x47 fp=0xc0000897b8 sp=0xc0000896e0 pc=0x5b99854c2487 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner.Execute.gowrap2() Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:984 +0x28 fp=0xc0000897e0 sp=0xc0000897b8 pc=0x5b99854c7628 Dec 31 01:58:09 FORGE ollama[555324]: runtime.goexit({}) Dec 31 01:58:09 FORGE ollama[555324]: runtime/asm_amd64.s:1700 +0x1 fp=0xc0000897e8 sp=0xc0000897e0 pc=0x5b998528a561 Dec 31 01:58:09 FORGE ollama[555324]: created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 Dec 31 01:58:09 FORGE ollama[555324]: github.com/ollama/ollama/llama/runner/runner.go:984 +0xde5 Dec 31 01:58:09 FORGE ollama[555324]: rax 0x0 Dec 31 01:58:09 FORGE ollama[555324]: rbx 0x1b9223 Dec 31 01:58:09 FORGE ollama[555324]: rcx 0x7bcd42e9eb1c Dec 31 01:58:09 FORGE ollama[555324]: rdx 0x6 Dec 31 01:58:09 FORGE ollama[555324]: rdi 0x1b9221 Dec 31 01:58:09 FORGE ollama[555324]: rsi 0x1b9223 Dec 31 01:58:09 FORGE ollama[555324]: rbp 0x7bccfb9de330 Dec 31 01:58:09 FORGE ollama[555324]: rsp 0x7bccfb9de2f0 Dec 31 01:58:09 FORGE ollama[555324]: r8 0x0 Dec 31 01:58:09 FORGE ollama[555324]: r9 0x0 Dec 31 01:58:09 FORGE ollama[555324]: r10 0x8 Dec 31 01:58:09 FORGE ollama[555324]: r11 0x246 Dec 31 01:58:09 FORGE ollama[555324]: r12 0x6 Dec 31 01:58:09 FORGE ollama[555324]: r13 0x60 Dec 31 01:58:09 FORGE ollama[555324]: r14 0x16 Dec 31 01:58:09 FORGE ollama[555324]: r15 0x7bcce4000c60 Dec 31 01:58:09 FORGE ollama[555324]: rip 0x7bcd42e9eb1c Dec 31 01:58:09 FORGE ollama[555324]: rflags 0x246 Dec 31 01:58:09 FORGE ollama[555324]: cs 0x33 Dec 31 01:58:09 FORGE ollama[555324]: fs 0x0 Dec 31 01:58:09 FORGE ollama[555324]: gs 0x0 Dec 31 01:58:09 FORGE ollama[555324]: time=2024-12-31T01:58:09.119-05:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: CUDA error: CUDA-capable device(s) is/are busy or unavailable\n current device: 0, in function ggml_backend_cuda_device_get_memory at llama/ggml-cuda/ggml-cuda.cu:2769\n cudaMemGetInfo(free, total)\nllama/ggml-cuda/ggml-cuda.cu:96: CUDA error" Dec 31 01:58:09 FORGE ollama[555324]: [GIN] 2024/12/31 - 01:58:09 | 500 | 274.433471ms | 127.0.0.1 | POST "/api/generate" Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.120-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:09 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:09.872-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.622-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:10 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:10.872-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:11 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:11.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:12 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:12.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.121-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.371-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.621-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:13 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:13.871-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.121-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001576273 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.122-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.371-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251697771 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 Dec 31 01:58:14 FORGE ollama[555324]: cuda driver library failed to get device context 46time=2024-12-31T01:58:14.372-05:00 level=WARN source=gpu.go:449 msg="error looking up nvidia GPU memory" Dec 31 01:58:14 FORGE ollama[555324]: time=2024-12-31T01:58:14.620-05:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.500690492 model=/usr/share/ollama/.ollama/models/blobs/sha256-2af3b81862c6be03c769683af18efdadb2c33f60ff32ab6f83e42c043d6c7816 chris@FORGE:~/ai/aiprojects/OllamaPlayground/WorkingExamples/rag2$ ``` are you able to recreate it?
Author
Owner

@rick-github commented on GitHub (Dec 31, 2024):

Unable to recreate. Some time earlier than Dec 31 01:21:27 in the full log is an event that causes subsequent accesses to the GPU to fail with status code 46. Since the sum total of the VRAM required for all three models is > 24G, I suspect that model swapping somehow caused the GPU to get wedged and this is preventing other models from accessing it. If we could see the full log there might be some clue. You can try reloading the device driver to see if that clears the block:

sudo systemctl stop ollama
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
sudo systemctl start ollama

Alternatively rebooting may resolve the problem.

<!-- gh-comment-id:2566263349 --> @rick-github commented on GitHub (Dec 31, 2024): Unable to recreate. Some time earlier than Dec 31 01:21:27 in the full log is an event that causes subsequent accesses to the GPU to fail with status [code 46](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#:~:text=cudaErrorDevicesUnavailable%20%3D%2046). Since the sum total of the VRAM required for all three models is > 24G, I suspect that model swapping somehow caused the GPU to get wedged and this is preventing other models from accessing it. If we could see the full log there might be some clue. You can try reloading the device driver to see if that clears the block: ```console sudo systemctl stop ollama sudo rmmod nvidia_uvm sudo modprobe nvidia_uvm sudo systemctl start ollama ``` Alternatively rebooting may resolve the problem.
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

That made no difference, I can consistently break it. Here are the logs for all of today which will probably include last nights as well.
Good luck!
logs.txt.zip

<!-- gh-comment-id:2566573949 --> @iplayfast commented on GitHub (Dec 31, 2024): That made no difference, I can consistently break it. Here are the logs for all of today which will probably include last nights as well. Good luck! [logs.txt.zip](https://github.com/user-attachments/files/18282606/logs.txt.zip)
Author
Owner

@rick-github commented on GitHub (Dec 31, 2024):

Which part made no difference, rmmod and/or rebooting?

<!-- gh-comment-id:2566584670 --> @rick-github commented on GitHub (Dec 31, 2024): Which part made no difference, `rmmod` and/or rebooting?
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

Back after a reboot. It's now working and I can't seem to break it. Even my rag example is working now. So this is a error that might or might not happen. :/

It seems that ollama ps is showing that models are not being kept in memory for the states 4 minutes from now, but disappear and reappear as needed or when other models are loaded. I had thought they were supposed to stay in memory for the 4 minutes and other models load along side. After loading 6 or so models, I eventually was able to keep 3 in memory at one time.

Anyways here is the same log with the additional testing I did.

journalctl -u ollama --no-pager -S 2024-12-31 >log1.txt

log1.txt.zip

<!-- gh-comment-id:2566588484 --> @iplayfast commented on GitHub (Dec 31, 2024): Back after a reboot. It's now working and I can't seem to break it. Even my rag example is working now. So this is a error that might or might not happen. :/ It seems that ollama ps is showing that models are not being kept in memory for the states 4 minutes from now, but disappear and reappear as needed or when other models are loaded. I had thought they were supposed to stay in memory for the 4 minutes and other models load along side. After loading 6 or so models, I eventually was able to keep 3 in memory at one time. Anyways here is the same log with the additional testing I did. journalctl -u ollama --no-pager -S 2024-12-31 >log1.txt [log1.txt.zip](https://github.com/user-attachments/files/18282701/log1.txt.zip)
Author
Owner

@iplayfast commented on GitHub (Dec 31, 2024):

rmmod made no difference, rebooting did.

<!-- gh-comment-id:2566588928 --> @iplayfast commented on GitHub (Dec 31, 2024): rmmod made no difference, rebooting did.
Author
Owner

@rick-github commented on GitHub (Jan 1, 2025):

Incomplete log. At this point it looks like a random error wedged the GPU. The rmmod may have failed because the module was still in use, if there were any error messages that would help.

The models are not being kept in memory because the sum total of the VRAM required for all three models is > 24G, which is what your 4090 has available. If there's not enough VRAM to hold a new model, ollama will evict a loaded one to make room. Tweaking the size of the context window, using flash attention (OLLAMA_FLASH_ATTENTION) or reducing the number of concurrent completions (OLLAMA_NUM_PARALLEL) will make it easier to have multiple co-resident models.

<!-- gh-comment-id:2566774244 --> @rick-github commented on GitHub (Jan 1, 2025): Incomplete log. At this point it looks like a random error wedged the GPU. The `rmmod` may have failed because the module was still in use, if there were any error messages that would help. The models are not being kept in memory because the sum total of the VRAM required for all three models is > 24G, which is what your 4090 has available. If there's not enough VRAM to hold a new model, ollama will evict a loaded one to make room. Tweaking the size of the context window, using flash attention (`OLLAMA_FLASH_ATTENTION`) or reducing the number of concurrent completions (`OLLAMA_NUM_PARALLEL`) will make it easier to have multiple co-resident models.
Author
Owner

@iplayfast commented on GitHub (Jan 1, 2025):

I think I know what's going on. Just got back and my computer had gone to sleep (suspended). I woke it up and things seemed fine, but then this error occurred again.
Looking on the web, this is a known cuda problem with suspend. So I think there is not much to do with ollama on this. Close the issue if you agree.

<!-- gh-comment-id:2566881643 --> @iplayfast commented on GitHub (Jan 1, 2025): I think I know what's going on. Just got back and my computer had gone to sleep (suspended). I woke it up and things seemed fine, but then this error occurred again. Looking on the web, this is a known cuda problem with suspend. So I think there is not much to do with ollama on this. Close the issue if you agree.
Author
Owner
<!-- gh-comment-id:2566889342 --> @rick-github commented on GitHub (Jan 1, 2025): https://github.com/ollama/ollama/blob/main/docs/gpu.md#laptop-suspend-resume
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5289