[GH-ISSUE #6448] snowflake-arctic-embed:22m model cause an error on loading #4056

Closed
opened 2026-04-12 14:57:04 -05:00 by GiteaMirror · 41 comments
Owner

Originally created by @Abdulrahman392011 on GitHub (Aug 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6448

What is the issue?

llama runner process has terminated: signal: segmentation fault (core dumped)

this is the error I am getting every time I try to load that particular model.

all other models work fine including other embedding models.

OS

Linux

GPU

No response

CPU

Intel

Ollama version

0.3.6

Originally created by @Abdulrahman392011 on GitHub (Aug 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6448 ### What is the issue? llama runner process has terminated: signal: segmentation fault (core dumped) this is the error I am getting every time I try to load that particular model. all other models work fine including other embedding models. ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version 0.3.6
GiteaMirror added the bug label 2026-04-12 14:57:04 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

Server logs will help in debugging.

<!-- gh-comment-id:2299653628 --> @rick-github commented on GitHub (Aug 20, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

server logs.txt

<!-- gh-comment-id:2299694972 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): [server logs.txt](https://github.com/user-attachments/files/16681377/server.logs.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

is this an isolated incidence, does this model work in your machine?

this model is 50 megabytes and is the only small model that is English based offered by Ollama. I need it cause it is more suited for running on CPU.

<!-- gh-comment-id:2299704736 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): is this an isolated incidence, does this model work in your machine? this model is 50 megabytes and is the only small model that is English based offered by Ollama. I need it cause it is more suited for running on CPU.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

It works for me.

$ curl -s localhost:11434/api/embed -d '{"model":"snowflake-arctic-embed:22m","input":"Your text string goes here"}' | jq 'del(.embeddings[0][])'
{
  "model": "snowflake-arctic-embed:22m",
  "embeddings": [
    []
  ],
  "total_duration": 1284214993,
  "load_duration": 1184824949,
  "prompt_eval_count": 7
}

There's nothing in the logs that indicates why the runner crashed. Please add OLLAMA_DEBUG=1 to the server environment and try another query, more details may reveal what is going on.

<!-- gh-comment-id:2299715242 --> @rick-github commented on GitHub (Aug 20, 2024): It works for me. ``` $ curl -s localhost:11434/api/embed -d '{"model":"snowflake-arctic-embed:22m","input":"Your text string goes here"}' | jq 'del(.embeddings[0][])' { "model": "snowflake-arctic-embed:22m", "embeddings": [ [] ], "total_duration": 1284214993, "load_duration": 1184824949, "prompt_eval_count": 7 } ``` There's nothing in the logs that indicates why the runner crashed. Please add `OLLAMA_DEBUG=1` to the server environment and try another query, more details may reveal what is going on.
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

I ran this command:

curl http://localhost:11434/api/embed -d '{"model": "snowflake-arctic-embed:22m", "keep_alive": -1}' OLLAMA_DEBUG=1

here is the response:

{"error":"llama runner process has terminated: signal: segmentation fault (core dumped)"}curl: (3) URL rejected: Bad hostname

<!-- gh-comment-id:2299728992 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): I ran this command: curl http://localhost:11434/api/embed -d '{"model": "snowflake-arctic-embed:22m", "keep_alive": -1}' OLLAMA_DEBUG=1 here is the response: {"error":"llama runner process has terminated: signal: segmentation fault (core dumped)"}curl: (3) URL rejected: Bad hostname
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

I have [internet in a box] installed but I don't understand why all other models work except this one. I even thought that it's because of the size of the model and I kept searching for another small model that works till I got one in Chinese to work.

<!-- gh-comment-id:2299735394 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): I have [internet in a box] installed but I don't understand why all other models work except this one. I even thought that it's because of the size of the model and I kept searching for another small model that works till I got one in Chinese to work.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

You need to add the debug flag to /etc/systemd/system/ollama.service. Edit that file and in the [Service] section, add

Environment="OLLAMA_DEBUG=1"

Then restart the service:

sudo service ollama restart
<!-- gh-comment-id:2299737726 --> @rick-github commented on GitHub (Aug 20, 2024): You need to add the debug flag to /etc/systemd/system/ollama.service. Edit that file and in the [Service] section, add ``` Environment="OLLAMA_DEBUG=1" ``` Then restart the service: ``` sudo service ollama restart ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

I ran:
Environment="OLLAMA_DEBUG=1"
and:
sudo service ollama restart
then again:
curl http://localhost:11434/api/embed -d '{"model": "snowflake-arctic-embed:22m", "keep_alive": -1}':
and got the response:
{"error":"llama runner process has terminated: signal: segmentation fault (core dumped)"}

sorry if I don't get it right away, baby steps.

<!-- gh-comment-id:2299742656 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): I ran: Environment="OLLAMA_DEBUG=1" and: sudo service ollama restart then again: curl http://localhost:11434/api/embed -d '{"model": "snowflake-arctic-embed:22m", "keep_alive": -1}': and got the response: {"error":"llama runner process has terminated: signal: segmentation fault (core dumped)"} sorry if I don't get it right away, baby steps.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

What do the logs show now?

<!-- gh-comment-id:2299744274 --> @rick-github commented on GitHub (Aug 20, 2024): What do the logs show now?
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

server logs2.txt

<!-- gh-comment-id:2299747414 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): [server logs2.txt](https://github.com/user-attachments/files/16681827/server.logs2.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

for the record this happens with most of the small models that I tried except for this model:

shaw/dmeta-embedding-zh-q4

I thought maybe this one works cause it's quantized but then I remember that the large embedding models are running on fb16 also and they are working fine, so that can't be the reason. the only common thing between the models that i tried and didn't work was that they all less than 100 megabytes. I tried about 5 small models, only the one I mentioned above did work, the rest gave the same error.

<!-- gh-comment-id:2299773851 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): for the record this happens with most of the small models that I tried except for this model: shaw/dmeta-embedding-zh-q4 I thought maybe this one works cause it's quantized but then I remember that the large embedding models are running on fb16 also and they are working fine, so that can't be the reason. the only common thing between the models that i tried and didn't work was that they all less than 100 megabytes. I tried about 5 small models, only the one I mentioned above did work, the rest gave the same error.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

server_logs2.txt doesn't have OLLAMA_DEBUG=1. What's the contents of /etc/systemd/system/ollama.service?

<!-- gh-comment-id:2299785229 --> @rick-github commented on GitHub (Aug 20, 2024): server_logs2.txt doesn't have `OLLAMA_DEBUG=1`. What's the contents of `/etc/systemd/system/ollama.service`?
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin"

[Install]
WantedBy=default.target

<!-- gh-comment-id:2299787722 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin" [Install] WantedBy=default.target
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

it's worth mentioning that I have an Nvidia card in the laptop but it's compute capability 3.5 and Ollama required at least 3.7 and it runs the model on the CPU.
despite of this I remember when I installed Ollama it told me that an Nvidia card is available. could it be that the models are being loaded on the GPU and then failing afterwards? is there a way I can force the model to be loaded on the CPU?

<!-- gh-comment-id:2299793708 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): it's worth mentioning that I have an Nvidia card in the laptop but it's compute capability 3.5 and Ollama required at least 3.7 and it runs the model on the CPU. despite of this I remember when I installed Ollama it told me that an Nvidia card is available. could it be that the models are being loaded on the GPU and then failing afterwards? is there a way I can force the model to be loaded on the CPU?
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

You need to add OLLAMA_DEBUG=1 as described in https://github.com/ollama/ollama/issues/6448#issuecomment-2299737726

<!-- gh-comment-id:2299795081 --> @rick-github commented on GitHub (Aug 20, 2024): You need to add `OLLAMA_DEBUG=1` as described in https://github.com/ollama/ollama/issues/6448#issuecomment-2299737726
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

my bad. I will do it. I missed the text and just ran it as a command

<!-- gh-comment-id:2299796416 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): my bad. I will do it. I missed the text and just ran it as a command
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 20, 2024):

server logs3.txt

<!-- gh-comment-id:2299802409 --> @Abdulrahman392011 commented on GitHub (Aug 20, 2024): [server logs3.txt](https://github.com/user-attachments/files/16682104/server.logs3.txt)
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

ollama is only using cpu:

Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.248-04:00 level=INFO source=gpu.go:265 msg="[0] CUDA GPU is too old. Compute Capability detected: 3.5"
Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.249-04:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered"
Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.249-04:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="7.7 GiB" available="5.5 GiB"
Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.131-04:00 level=DEBUG source=sched.go:206 msg="cpu mode with first model, loading"

There's enough memory to load the model:

Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.131-04:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=7 layers.offload=0 layers.split="" memory.available="[5.5 GiB]" memory.required.full="65.1 MiB" memory.required.partial="0 B" memory.required.kv="6.0 MiB" memory.required.allocations="[65.1 MiB]" memory.weights.total="26.4 MiB" memory.weights.repeating="4.0 MiB" memory.weights.nonrepeating="22.4 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"

The runner starts:

Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.132-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama3718362331/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf --ctx-size 8192 --batch-size 512 --embedding --log-disable --verbose --no-mmap --parallel 4 --port 41107"

Unfortunately there's no indication of anything going wrong until the crash is detected:

Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.643-04:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"

Is there anything in /var/log/syslog that might indicate a problem - kernel panic, etc? What's the output of dmesg?

<!-- gh-comment-id:2299857518 --> @rick-github commented on GitHub (Aug 20, 2024): ollama is only using cpu: ``` Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.248-04:00 level=INFO source=gpu.go:265 msg="[0] CUDA GPU is too old. Compute Capability detected: 3.5" Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.249-04:00 level=INFO source=gpu.go:350 msg="no compatible GPUs were discovered" Aug 20 17:30:19 box ollama[11555]: time=2024-08-20T17:30:19.249-04:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="7.7 GiB" available="5.5 GiB" Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.131-04:00 level=DEBUG source=sched.go:206 msg="cpu mode with first model, loading" ``` There's enough memory to load the model: ``` Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.131-04:00 level=INFO source=memory.go:309 msg="offload to cpu" layers.requested=-1 layers.model=7 layers.offload=0 layers.split="" memory.available="[5.5 GiB]" memory.required.full="65.1 MiB" memory.required.partial="0 B" memory.required.kv="6.0 MiB" memory.required.allocations="[65.1 MiB]" memory.weights.total="26.4 MiB" memory.weights.repeating="4.0 MiB" memory.weights.nonrepeating="22.4 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" ``` The runner starts: ``` Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.132-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/tmp/ollama3718362331/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf --ctx-size 8192 --batch-size 512 --embedding --log-disable --verbose --no-mmap --parallel 4 --port 41107" ``` Unfortunately there's no indication of anything going wrong until the crash is detected: ``` Aug 20 17:30:38 box ollama[11555]: time=2024-08-20T17:30:38.643-04:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)" ``` Is there anything in `/var/log/syslog` that might indicate a problem - kernel panic, etc? What's the output of `dmesg`?
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 21, 2024):

syslog.zip

dmesg.txt

<!-- gh-comment-id:2301278938 --> @Abdulrahman392011 commented on GitHub (Aug 21, 2024): [syslog.zip](https://github.com/user-attachments/files/16687266/syslog.zip) [dmesg.txt](https://github.com/user-attachments/files/16687306/dmesg.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 21, 2024):

I tried installing ubntu24-server version and then installing ubuntu-desktop. installed ollama and pulled that specific model. got the same error. I didn't have [internet in a box] installed on ubuntu24-server. it was fresh install. I needed to exclude the stuff I have installed. I thought they could be the cause for the bug, but no.

also have you tried running that model on the CPU on your machine. it could be that the issue doesn't rise if the model is loaded on the GPU. just hopelessly guessing, haha.

<!-- gh-comment-id:2301287201 --> @Abdulrahman392011 commented on GitHub (Aug 21, 2024): I tried installing ubntu24-server version and then installing ubuntu-desktop. installed ollama and pulled that specific model. got the same error. I didn't have [internet in a box] installed on ubuntu24-server. it was fresh install. I needed to exclude the stuff I have installed. I thought they could be the cause for the bug, but no. also have you tried running that model on the CPU on your machine. it could be that the issue doesn't rise if the model is loaded on the GPU. just hopelessly guessing, haha.
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 21, 2024):

I need to take a break, I have been in this all night. also why is it that Ollama doesn't support any other small English models. it seems kinda weird that this is the only one. maybe there are other small English models but I just didn't find them or something. anyway. if you find anything in the syslogs or the dmesg text me.

thanks for your time.

<!-- gh-comment-id:2301295245 --> @Abdulrahman392011 commented on GitHub (Aug 21, 2024): I need to take a break, I have been in this all night. also why is it that Ollama doesn't support any other small English models. it seems kinda weird that this is the only one. maybe there are other small English models but I just didn't find them or something. anyway. if you find anything in the syslogs or the dmesg text me. thanks for your time.
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 21, 2024):

I needed to try one final thing. I downloaded the large default model for (snowflake-arctic-embed) and it gives the same error as the 22m and 33m. so the issue isn't the size of the models. it's just that this specific model is not working in general.

you have it working on your machine. maybe you had them installed in the past. try removing the 22m model and re-pulling it again and see if the issue maybe in a broken link or bad model file or something related to the model repository.

<!-- gh-comment-id:2301336449 --> @Abdulrahman392011 commented on GitHub (Aug 21, 2024): I needed to try one final thing. I downloaded the large default model for (snowflake-arctic-embed) and it gives the same error as the 22m and 33m. so the issue isn't the size of the models. it's just that this specific model is not working in general. you have it working on your machine. maybe you had them installed in the past. try removing the 22m model and re-pulling it again and see if the issue maybe in a broken link or bad model file or something related to the model repository.
Author
Owner

@rick-github commented on GitHub (Aug 21, 2024):

I downloaded 22m specifically to test. I am using CPU only in the tests.

Your logs show repeated kernel failures for the nvidia_uvm kernel module. These are described as "harmless" by Nvidia, so may not be the cause of your problem. The runner doesn't use the GPU, but ollama does probe for GPUs so that's what's triggering these.

The kernel logs the segfaults from the runner:

2024-08-21T02:53:20.319877-04:00 box kernel: ollama_llama_se[4991]: segfault at 75b136bffd20 ip 00000000004f4e28 sp 00007ffeda157518 error 4 in ollama_llama_server[40b000+2be000] likely on CPU 1 (core 1, socket 0)
2024-08-21T02:53:20.319884-04:00 box kernel: Code: f7 ff 48 89 ef e8 98 75 f1 ff 0f 1f 84 00 00 00 00 00 48 89 d1 48 85 d2 7e 28 49 c7 c0 c0 30 76 00 31 c0 0f 1f 80 00 00 00 00 <0f> b7 14 47 c4 c1 7a 10 04 90 c5 fa 11 04 86 48 83 c0 01 48 39 c1

This is ggml_fp16_to_fp32_row:

00000000004f4e10 <ggml_fp16_to_fp32_row>:
  4f4e10:       48 89 d1                mov    %rdx,%rcx
  4f4e13:       48 85 d2                test   %rdx,%rdx
  4f4e16:       7e 28                   jle    4f4e40 <ggml_fp16_to_fp32_row+0x30>
  4f4e18:       49 c7 c0 c0 30 76 00    mov    $0x7630c0,%r8
  4f4e1f:       31 c0                   xor    %eax,%eax
  4f4e21:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 *4f4e28:       0f b7 14 47             movzwl (%rdi,%rax,2),%edx
  4f4e2c:       c4 c1 7a 10 04 90       vmovss (%r8,%rdx,4),%xmm0
  4f4e32:       c5 fa 11 04 86          vmovss %xmm0,(%rsi,%rax,4)
  4f4e37:       48 83 c0 01             add    $0x1,%rax
  4f4e3b:       48 39 c1                cmp    %rax,%rcx
  4f4e3e:       75 e8                   jne    4f4e28 <ggml_fp16_to_fp32_row+0x18>
  4f4e40:       c3                      ret    

which implies that the x, y or n arguments to the function are invalid. Without a stack frame or register dump it's hard to debug further.

The question is why is ollama on your system unable to load snowflake-arctic-embed models when nomic-embed-text works fine, and snowflake-arctic-embed works for others.

This is a bit of a reach, but one thing you can try is using one of the other CPU based runners. Set OLLAMA_LLM_LIBRARY=cpu in your server environment and restart ollama and try a query. This will use the most basic runner, and will run a bit slower. If that works, try OLLAMA_LLM_LIBRARY=cpu_avx.

You didn't mention it above, all-minilm is a small model that might suit your needs. Alternatively huggingface has text classification models that you could convert to use with ollama..

<!-- gh-comment-id:2302590446 --> @rick-github commented on GitHub (Aug 21, 2024): I downloaded 22m specifically to test. I am using CPU only in the tests. Your logs show repeated kernel failures for the nvidia_uvm kernel module. These are described as "harmless" by Nvidia, so may not be the cause of your problem. The runner doesn't use the GPU, but ollama does probe for GPUs so that's what's triggering these. The kernel logs the segfaults from the runner: ``` 2024-08-21T02:53:20.319877-04:00 box kernel: ollama_llama_se[4991]: segfault at 75b136bffd20 ip 00000000004f4e28 sp 00007ffeda157518 error 4 in ollama_llama_server[40b000+2be000] likely on CPU 1 (core 1, socket 0) 2024-08-21T02:53:20.319884-04:00 box kernel: Code: f7 ff 48 89 ef e8 98 75 f1 ff 0f 1f 84 00 00 00 00 00 48 89 d1 48 85 d2 7e 28 49 c7 c0 c0 30 76 00 31 c0 0f 1f 80 00 00 00 00 <0f> b7 14 47 c4 c1 7a 10 04 90 c5 fa 11 04 86 48 83 c0 01 48 39 c1 ``` This is [ggml_fp16_to_fp32_row](https://github.com/ggerganov/llama.cpp/blob/fc54ef0d1c138133a01933296d50a36a1ab64735/ggml/src/ggml.c#L436): ``` 00000000004f4e10 <ggml_fp16_to_fp32_row>: 4f4e10: 48 89 d1 mov %rdx,%rcx 4f4e13: 48 85 d2 test %rdx,%rdx 4f4e16: 7e 28 jle 4f4e40 <ggml_fp16_to_fp32_row+0x30> 4f4e18: 49 c7 c0 c0 30 76 00 mov $0x7630c0,%r8 4f4e1f: 31 c0 xor %eax,%eax 4f4e21: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) *4f4e28: 0f b7 14 47 movzwl (%rdi,%rax,2),%edx 4f4e2c: c4 c1 7a 10 04 90 vmovss (%r8,%rdx,4),%xmm0 4f4e32: c5 fa 11 04 86 vmovss %xmm0,(%rsi,%rax,4) 4f4e37: 48 83 c0 01 add $0x1,%rax 4f4e3b: 48 39 c1 cmp %rax,%rcx 4f4e3e: 75 e8 jne 4f4e28 <ggml_fp16_to_fp32_row+0x18> 4f4e40: c3 ret ``` which implies that the `x`, `y` or `n` arguments to the function are invalid. Without a stack frame or register dump it's hard to debug further. The question is why is ollama on your system unable to load snowflake-arctic-embed models when nomic-embed-text works fine, and snowflake-arctic-embed works for others. This is a bit of a reach, but one thing you can try is using one of the other CPU based runners. Set `OLLAMA_LLM_LIBRARY=cpu` in your server environment and restart ollama and try a query. This will use the most basic runner, and will run a bit slower. If that works, try `OLLAMA_LLM_LIBRARY=cpu_avx`. You didn't mention it above, [all-minilm](https://ollama.com/library/all-minilm/tags) is a small model that might suit your needs. Alternatively huggingface has [text classification models](https://huggingface.co/models?pipeline_tag=text-classification&sort=trending) that you could convert to use with ollama..
Author
Owner

@rick-github commented on GitHub (Aug 21, 2024):

If you have gdb installed on your machine, you could add some extra debugging information by running this command:

gdb -ex="set confirm off" -ex="set pagination off" -ex=r -ex=br -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cpu_avx2/ollama_llama_server | tail -1) \
  --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf \
  --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 33353
<!-- gh-comment-id:2302653737 --> @rick-github commented on GitHub (Aug 21, 2024): If you have `gdb` installed on your machine, you could add some extra debugging information by running this command: ```bash gdb -ex="set confirm off" -ex="set pagination off" -ex=r -ex=br -ex=disassemble -ex="i r" -ex=q --args \ $(ls /tmp/ollama*/runners/cpu_avx2/ollama_llama_server | tail -1) \ --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf \ --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 33353 ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

how about this, I run (nomic-embed-text) and then run (snowflake-arctic-embed).
then send you the logs and stuff and you compare the two and see what is the difference between when it run successfully (nomic) and when it fails (snowflake).

i will try to install gdb to give a little bit more insight.

<!-- gh-comment-id:2303366310 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): how about this, I run (nomic-embed-text) and then run (snowflake-arctic-embed). then send you the logs and stuff and you compare the two and see what is the difference between when it run successfully (nomic) and when it fails (snowflake). i will try to install gdb to give a little bit more insight.
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

gdb_output.txt

<!-- gh-comment-id:2303532513 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): [gdb_output.txt](https://github.com/user-attachments/files/16702994/gdb_output.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

gdb_output_after (nomic).txt

<!-- gh-comment-id:2303535689 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): [gdb_output_after (nomic).txt](https://github.com/user-attachments/files/16703021/gdb_output_after.nomic.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

gdb_output_after(snowflake).txt

<!-- gh-comment-id:2303537339 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): [gdb_output_after(snowflake).txt](https://github.com/user-attachments/files/16703031/gdb_output_after.snowflake.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

they look identical. should I run the model loading command within the gdb command?

<!-- gh-comment-id:2303538998 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): they look identical. should I run the model loading command within the gdb command?
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

by the way I just tried all-minilm that you mentioned above and it works. so that solves my issue.

for the rest if you interested like out of curiosity to keep fixing the bug. then we continue. but just know that my issue has been solved, thanks to you.

<!-- gh-comment-id:2303542721 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): by the way I just tried all-minilm that you mentioned above and it works. so that solves my issue. for the rest if you interested like out of curiosity to keep fixing the bug. then we continue. but just know that my issue has been solved, thanks to you.
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

I'm glad it's resolved. The command I gave earlier has a mistake in it, if you could run the corrected one below it will give me a better idea of where the error is happening. But if you are happy with all-minillm, we can close this and use it as reference later if somebody else has a similar problem.

gdb -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cpu_avx2/ollama_llama_server | tail -1) \
  --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf \
  --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 33353
<!-- gh-comment-id:2303858594 --> @rick-github commented on GitHub (Aug 22, 2024): I'm glad it's resolved. The command I gave earlier has a mistake in it, if you could run the corrected one below it will give me a better idea of where the error is happening. But if you are happy with all-minillm, we can close this and use it as reference later if somebody else has a similar problem. ``` gdb -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \ $(ls /tmp/ollama*/runners/cpu_avx2/ollama_llama_server | tail -1) \ --model /usr/share/ollama/.ollama/models/blobs/sha256-a83b0493f894772b41e5cfe3e6effc288a94d300cfbbeb900891f8377f358bbf \ --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 33353 ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

gdb_after(nomic).txt
gdb_after(snowflake).txt

<!-- gh-comment-id:2303971960 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): [gdb_after(nomic).txt](https://github.com/user-attachments/files/16707013/gdb_after.nomic.txt) [gdb_after(snowflake).txt](https://github.com/user-attachments/files/16707014/gdb_after.snowflake.txt)
Author
Owner

@Abdulrahman392011 commented on GitHub (Aug 22, 2024):

something funny about old laptops in general.
I had a macbook air 2016 and I got bored one day and decided to open it up. long story short I severed the display flex cable and the usb flex cable.
I went to apple and they literally said that 2016 is vintage and they no longer make replacement parts for it. they said vintage like I should go put it in a museum, haha.

moral of the story is that my 2015 dell laptop is unmaintained and old machines be buggy.

<!-- gh-comment-id:2303979765 --> @Abdulrahman392011 commented on GitHub (Aug 22, 2024): something funny about old laptops in general. I had a macbook air 2016 and I got bored one day and decided to open it up. long story short I severed the display flex cable and the usb flex cable. I went to apple and they literally said that 2016 is vintage and they no longer make replacement parts for it. they said vintage like I should go put it in a museum, haha. moral of the story is that my 2015 dell laptop is unmaintained and old machines be buggy.
Author
Owner

@Milor123 commented on GitHub (Aug 22, 2024):

@rick-github I've the same issue or similar using nomic-embed-text, I am using llama3.1, the ollama is running in local over my manjaro, while i try add a embedding using anythingllm.

OLLAMA_DEBUG=1 ollama serve
2024/08/22 17:05:13 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/noe/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-22T17:05:13.276-05:00 level=INFO source=images.go:781 msg="total blobs: 72"
time=2024-08-22T17:05:13.277-05:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0"
time=2024-08-22T17:05:13.278-05:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.3.3)"
time=2024-08-22T17:05:13.278-05:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1214628755/runners
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcublas.so.12.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcublasLt.so.12.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcudart.so.12.gz
time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/ollama_llama_server.gz
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server
time=2024-08-22T17:05:15.914-05:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v12 cpu]"
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-22T17:05:15.914-05:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/home/noe/clonados/chroma/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-08-22T17:05:15.939-05:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.550.107.02 /usr/lib32/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.107.02]"
CUDA driver version: 12.4
time=2024-08-22T17:05:15.953-05:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.550.107.02
[GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] CUDA totalMem 11787 mb
[GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] CUDA freeMem 11035 mb
[GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] Compute Capability 8.9
time=2024-08-22T17:05:16.077-05:00 level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-08-22T17:05:16.077-05:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4070" total="11.5 GiB" available="10.8 GiB"
[GIN] 2024/08/22 - 17:05:31 | 200 |      28.261µs |       127.0.0.1 | HEAD     "/"
time=2024-08-22T17:05:31.854-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="23.2 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:32.003-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:32.003-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x556bf34cad80 gpu_count=1
time=2024-08-22T17:05:32.005-05:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:32.005-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-08-22T17:05:32.006-05:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx parallel=4 available=11571036160 required="1.0 GiB"
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=server.go:100 msg="system memory" total="46.4 GiB" free="22.9 GiB" free_swap="94.0 GiB"
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-08-22T17:05:32.006-05:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[10.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server
time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server
time=2024-08-22T17:05:32.007-05:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199"
time=2024-08-22T17:05:32.007-05:00 level=DEBUG source=server.go:401 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/home/noe/.bun/bin:/usr/lib/ccache/bin:/opt/miniconda3/condabin:/home/noe/.pyenv/plugins/pyenv-virtualenv/shims:/home/noe/.pyenv/shims:/home/noe/.nvm/versions/node/v20.11.1/bin:/home/noe/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/home/noe/.cargo/bin LD_LIBRARY_PATH=/tmp/ollama1214628755/runners/cuda_v12:/tmp/ollama1214628755/runners CUDA_VISIBLE_DEVICES=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx]"
time=2024-08-22T17:05:32.007-05:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-22T17:05:32.007-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:32.007-05:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-22T17:05:32.008-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3485 commit="6eeaeba12" tid="139638475837440" timestamp=1724364332
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139638475837440" timestamp=1724364332 total_threads=20
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46199" tid="139638475837440" timestamp=1724364332
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    22.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     2.51 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
[1724364332] warming up the model with an empty run
/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
time=2024-08-22T17:05:32.459-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding"
time=2024-08-22T17:05:33.160-05:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:454 msg="triggering expiration for failed load" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:33.160-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)"
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 "
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
[GIN] 2024/08/22 - 17:05:33 | 500 |   1.32818839s |       127.0.0.1 | POST     "/api/embeddings"
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:33.161-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=server.go:1042 msg="stopping llama server"
time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:33.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:33.663-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:33.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:33.927-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:34.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:34.172-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:34.291-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:34.429-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:34.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:34.666-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:34.791-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:34.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:35.042-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:35.164-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:35.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:35.417-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:35.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:35.660-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:35.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:35.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:36.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:36.160-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:36.291-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:36.415-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:36.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:36.662-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:36.791-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:36.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:37.042-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:37.164-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:37.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:37.411-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:37.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:37.661-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:37.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:37.923-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:38.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:38.162-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:38.292-05:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131142177 model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.417-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:38.417-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:38.541-05:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380583545 model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.541-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:38.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-08-22T17:05:38.544-05:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx parallel=4 available=11571036160 required="1.0 GiB"
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=server.go:100 msg="system memory" total="46.4 GiB" free="22.8 GiB" free_swap="94.0 GiB"
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]"
time=2024-08-22T17:05:38.544-05:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[10.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server
time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server
time=2024-08-22T17:05:38.545-05:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46485"
time=2024-08-22T17:05:38.545-05:00 level=DEBUG source=server.go:401 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/home/noe/.bun/bin:/usr/lib/ccache/bin:/opt/miniconda3/condabin:/home/noe/.pyenv/plugins/pyenv-virtualenv/shims:/home/noe/.pyenv/shims:/home/noe/.nvm/versions/node/v20.11.1/bin:/home/noe/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/home/noe/.cargo/bin LD_LIBRARY_PATH=/tmp/ollama1214628755/runners/cuda_v12:/tmp/ollama1214628755/runners CUDA_VISIBLE_DEVICES=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx]"
time=2024-08-22T17:05:38.545-05:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-22T17:05:38.545-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:38.545-05:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-22T17:05:38.546-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3485 commit="6eeaeba12" tid="139868954624000" timestamp=1724364338
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139868954624000" timestamp=1724364338 total_threads=20
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46485" tid="139868954624000" timestamp=1724364338
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
time=2024-08-22T17:05:38.724-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="767.2 MiB"
releasing cuda driver library
time=2024-08-22T17:05:38.724-05:00 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 5.56 seconds" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    22.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     2.51 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
[1724364338] warming up the model with an empty run
/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
time=2024-08-22T17:05:38.997-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding"
time=2024-08-22T17:05:39.720-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
time=2024-08-22T17:05:39.970-05:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:454 msg="triggering expiration for failed load" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.970-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)"
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 "
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.970-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)"
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 "
[GIN] 2024/08/22 - 17:05:39 | 500 |  8.136669758s |       127.0.0.1 | POST     "/api/embeddings"
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.971-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:39.971-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=server.go:1042 msg="stopping llama server"
time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-22T17:05:40.344-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:40.472-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:40.594-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:40.722-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:40.844-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
time=2024-08-22T17:05:40.967-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library
time=2024-08-22T17:05:41.093-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB"
CUDA driver version: 12.4
^Ctime=2024-08-22T17:05:41.181-05:00 level=DEBUG source=assets.go:112 msg="cleaning up" dir=/tmp/ollama1214628755
time=2024-08-22T17:05:41.181-05:00 level=DEBUG source=sched.go:294 msg="shutting down scheduler pending loop"
time=2024-08-22T17:05:41.221-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB"
releasing cuda driver library

<!-- gh-comment-id:2305832674 --> @Milor123 commented on GitHub (Aug 22, 2024): @rick-github I've the same issue or similar using `nomic-embed-text`, I am using llama3.1, the ollama is running in local over my manjaro, while i try add a embedding using anythingllm. ``` OLLAMA_DEBUG=1 ollama serve 2024/08/22 17:05:13 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/noe/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-22T17:05:13.276-05:00 level=INFO source=images.go:781 msg="total blobs: 72" time=2024-08-22T17:05:13.277-05:00 level=INFO source=images.go:788 msg="total unused blobs removed: 0" time=2024-08-22T17:05:13.278-05:00 level=INFO source=routes.go:1155 msg="Listening on 127.0.0.1:11434 (version 0.3.3)" time=2024-08-22T17:05:13.278-05:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1214628755/runners time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcublas.so.12.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcublasLt.so.12.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/libcudart.so.12.gz time=2024-08-22T17:05:13.278-05:00 level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v12 file=build/linux/x86_64/cuda_v12/bin/ollama_llama_server.gz time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server time=2024-08-22T17:05:15.914-05:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v12 cpu]" time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-08-22T17:05:15.914-05:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-08-22T17:05:15.914-05:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/home/noe/clonados/chroma/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-08-22T17:05:15.939-05:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.550.107.02 /usr/lib32/libcuda.so.550.107.02 /usr/lib64/libcuda.so.550.107.02]" CUDA driver version: 12.4 time=2024-08-22T17:05:15.953-05:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.550.107.02 [GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] CUDA totalMem 11787 mb [GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] CUDA freeMem 11035 mb [GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx] Compute Capability 8.9 time=2024-08-22T17:05:16.077-05:00 level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2024-08-22T17:05:16.077-05:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx library=cuda compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4070" total="11.5 GiB" available="10.8 GiB" [GIN] 2024/08/22 - 17:05:31 | 200 | 28.261µs | 127.0.0.1 | HEAD "/" time=2024-08-22T17:05:31.854-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="23.2 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:32.003-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:32.003-05:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x556bf34cad80 gpu_count=1 time=2024-08-22T17:05:32.005-05:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:32.005-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-08-22T17:05:32.006-05:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx parallel=4 available=11571036160 required="1.0 GiB" time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=server.go:100 msg="system memory" total="46.4 GiB" free="22.9 GiB" free_swap="94.0 GiB" time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-08-22T17:05:32.006-05:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[10.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server time=2024-08-22T17:05:32.006-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server time=2024-08-22T17:05:32.007-05:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199" time=2024-08-22T17:05:32.007-05:00 level=DEBUG source=server.go:401 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/home/noe/.bun/bin:/usr/lib/ccache/bin:/opt/miniconda3/condabin:/home/noe/.pyenv/plugins/pyenv-virtualenv/shims:/home/noe/.pyenv/shims:/home/noe/.nvm/versions/node/v20.11.1/bin:/home/noe/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/home/noe/.cargo/bin LD_LIBRARY_PATH=/tmp/ollama1214628755/runners/cuda_v12:/tmp/ollama1214628755/runners CUDA_VISIBLE_DEVICES=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx]" time=2024-08-22T17:05:32.007-05:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-22T17:05:32.007-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:32.007-05:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-22T17:05:32.008-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3485 commit="6eeaeba12" tid="139638475837440" timestamp=1724364332 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139638475837440" timestamp=1724364332 total_threads=20 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46199" tid="139638475837440" timestamp=1724364332 llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.2032 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = nomic-bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 137M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 136.73 M llm_load_print_meta: model size = 260.86 MiB (16.00 BPW) llm_load_print_meta: general.name = nomic-embed-text-v1.5 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 21 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.10 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 44.72 MiB llm_load_tensors: CUDA0 buffer size = 216.15 MiB llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1152.00 MiB llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 22.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 2.51 MiB llama_new_context_with_model: graph nodes = 453 llama_new_context_with_model: graph splits = 2 [1724364332] warming up the model with an empty run /usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. time=2024-08-22T17:05:32.459-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding" time=2024-08-22T17:05:33.160-05:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)" time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:454 msg="triggering expiration for failed load" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:33.160-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)" time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 " time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 [GIN] 2024/08/22 - 17:05:33 | 500 | 1.32818839s | 127.0.0.1 | POST "/api/embeddings" time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0 time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:33.160-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:33.161-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=server.go:1042 msg="stopping llama server" time=2024-08-22T17:05:33.291-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:33.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:33.663-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:33.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:33.927-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:34.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:34.172-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:34.291-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:34.429-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:34.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:34.666-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:34.791-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:34.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:35.042-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:35.164-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:35.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:35.417-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:35.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:35.660-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:35.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:35.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:36.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:36.160-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:36.291-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:36.415-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:36.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:36.662-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:36.791-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:36.916-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:37.042-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:37.164-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:37.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:37.411-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:37.541-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:37.661-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:37.792-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:37.923-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:38.041-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:38.162-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:38.292-05:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.131142177 model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.292-05:00 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.417-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:38.417-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:38.541-05:00 level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=5.380583545 model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.541-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:38.542-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-08-22T17:05:38.544-05:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx parallel=4 available=11571036160 required="1.0 GiB" time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=server.go:100 msg="system memory" total="46.4 GiB" free="22.8 GiB" free_swap="94.0 GiB" time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[10.8 GiB]" time=2024-08-22T17:05:38.544-05:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[10.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cpu_avx2/ollama_llama_server time=2024-08-22T17:05:38.544-05:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server time=2024-08-22T17:05:38.545-05:00 level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama1214628755/runners/cuda_v12/ollama_llama_server --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46485" time=2024-08-22T17:05:38.545-05:00 level=DEBUG source=server.go:401 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/home/noe/.bun/bin:/usr/lib/ccache/bin:/opt/miniconda3/condabin:/home/noe/.pyenv/plugins/pyenv-virtualenv/shims:/home/noe/.pyenv/shims:/home/noe/.nvm/versions/node/v20.11.1/bin:/home/noe/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/home/noe/.cargo/bin LD_LIBRARY_PATH=/tmp/ollama1214628755/runners/cuda_v12:/tmp/ollama1214628755/runners CUDA_VISIBLE_DEVICES=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx]" time=2024-08-22T17:05:38.545-05:00 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-22T17:05:38.545-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:38.545-05:00 level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-22T17:05:38.546-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=3485 commit="6eeaeba12" tid="139868954624000" timestamp=1724364338 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="139868954624000" timestamp=1724364338 total_threads=20 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46485" tid="139868954624000" timestamp=1724364338 llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.2032 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = nomic-bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 137M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 136.73 M llm_load_print_meta: model size = 260.86 MiB (16.00 BPW) llm_load_print_meta: general.name = nomic-embed-text-v1.5 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 21 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes time=2024-08-22T17:05:38.724-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="767.2 MiB" releasing cuda driver library time=2024-08-22T17:05:38.724-05:00 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 5.56 seconds" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 llm_load_tensors: ggml ctx size = 0.10 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 44.72 MiB llm_load_tensors: CUDA0 buffer size = 216.15 MiB llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1152.00 MiB llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 22.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 2.51 MiB llama_new_context_with_model: graph nodes = 453 llama_new_context_with_model: graph splits = 2 [1724364338] warming up the model with an empty run /usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. time=2024-08-22T17:05:38.997-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding" time=2024-08-22T17:05:39.720-05:00 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" time=2024-08-22T17:05:39.970-05:00 level=ERROR source=sched.go:451 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)" time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:454 msg="triggering expiration for failed load" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.970-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)" time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 " time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0 time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.970-05:00 level=WARN source=server.go:503 msg="llama runner process no longer running" sys=134 string="signal: aborted (core dumped)" time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=server.go:572 msg="server unhealthy" error="llama runner process no longer running: -1 " [GIN] 2024/08/22 - 17:05:39 | 500 | 8.136669758s | 127.0.0.1 | POST "/api/embeddings" time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 refCount=0 time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.970-05:00 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.971-05:00 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:39.971-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=server.go:1042 msg="stopping llama server" time=2024-08-22T17:05:40.093-05:00 level=DEBUG source=sched.go:376 msg="runner released" modelPath=/home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-22T17:05:40.344-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:40.472-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:40.594-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.9 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:40.722-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:40.844-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.9 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 time=2024-08-22T17:05:40.967-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library time=2024-08-22T17:05:41.093-05:00 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="46.4 GiB" before.free="22.8 GiB" before.free_swap="94.0 GiB" now.total="46.4 GiB" now.free="22.8 GiB" now.free_swap="94.0 GiB" CUDA driver version: 12.4 ^Ctime=2024-08-22T17:05:41.181-05:00 level=DEBUG source=assets.go:112 msg="cleaning up" dir=/tmp/ollama1214628755 time=2024-08-22T17:05:41.181-05:00 level=DEBUG source=sched.go:294 msg="shutting down scheduler pending loop" time=2024-08-22T17:05:41.221-05:00 level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-3fcffc7b-1af0-cc5c-xxxxxxxxxxx name="NVIDIA GeForce RTX 4070" overhead="0 B" before.total="11.5 GiB" before.free="10.8 GiB" now.total="11.5 GiB" now.free="10.8 GiB" now.used="752.8 MiB" releasing cuda driver library ```
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

It's a different problem:

/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Where did you get this version of ollama from? It's built with CUDA 12 while the official v0.3.3 uses CUDA 11.

You can try running the following command to see if there's more info on what the runner was trying to do when it crashed:

gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \
 --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \
 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199
<!-- gh-comment-id:2305867319 --> @rick-github commented on GitHub (Aug 22, 2024): It's a different problem: ``` /usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. ``` Where did you get this version of ollama from? It's built with CUDA 12 while the official v0.3.3 uses CUDA 11. You can try running the following command to see if there's more info on what the runner was trying to do when it crashed: ``` gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \ $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \ --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \ --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199 ```
Author
Owner

@Milor123 commented on GitHub (Aug 22, 2024):

It's a different problem:

/usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Where did you get this version of ollama from? It's built with CUDA 12 while the official v0.3.3 uses CUDA 11.

You can try running the following command to see if there's more info on what the runner was trying to do when it crashed:

gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \
 --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \
 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199

Thank u very much, let me try it, i am updating ollama and all my pc, I am in manjaro KDE playma with wayland, i am updating from ollama-cuda 0.3.1 to 0.3.5-1

<!-- gh-comment-id:2305873227 --> @Milor123 commented on GitHub (Aug 22, 2024): > It's a different problem: > > ``` > /usr/include/c++/14.1.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. > ``` > > Where did you get this version of ollama from? It's built with CUDA 12 while the official v0.3.3 uses CUDA 11. > > You can try running the following command to see if there's more info on what the runner was trying to do when it crashed: > > ``` > gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \ > $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \ > --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \ > --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199 > ``` Thank u very much, let me try it, i am updating ollama and all my pc, I am in manjaro KDE playma with wayland, i am updating from ollama-cuda 0.3.1 to 0.3.5-1
Author
Owner

@Milor123 commented on GitHub (Aug 22, 2024):

@rick-github

After update all my system i get this.

gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \
 --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \
 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199

zsh: no matches found: /tmp/ollama*/runners/cuda_v12/ollama_llama_server
--model: No existe el fichero o el directorio.
No executable file specified.
Use the "file" or "exec-file" command.
No stack.
No frame selected.
The program has no registers now.

if the problem could be CUDA12, then should i try use ollama over docker ?

<!-- gh-comment-id:2305895292 --> @Milor123 commented on GitHub (Aug 22, 2024): @rick-github After update all my system i get this. ``` gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \ $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \ --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \ --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199 zsh: no matches found: /tmp/ollama*/runners/cuda_v12/ollama_llama_server --model: No existe el fichero o el directorio. No executable file specified. Use the "file" or "exec-file" command. No stack. No frame selected. The program has no registers now. ``` if the problem could be CUDA12, then should i try use ollama over docker ?
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

This looks like ollama is not running. What's the output of service ollama status?

<!-- gh-comment-id:2305911258 --> @rick-github commented on GitHub (Aug 22, 2024): This looks like ollama is not running. What's the output of `service ollama status`?
Author
Owner

@Milor123 commented on GitHub (Aug 22, 2024):

This looks like ollama is not running. What's the output of service ollama status?

Ohh excuseme, that idiot hahaha sorry,

the new output with ollama serve loaded

gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \
  $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \
 --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \
 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199

zsh: no matches found: /tmp/ollama*/runners/cuda_v12/ollama_llama_server
--model: No existe el fichero o el directorio.
No executable file specified.
Use the "file" or "exec-file" command.
No stack.
No frame selected.
The program has no registers now.

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
INFO [main] build info | build=3535 commit="1e6f6554a" tid="140737353269248" timestamp=1724369443
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140737353269248" timestamp=1724369443 total_threads=20
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46199" tid="140737353269248" timestamp=1724369443
[New Thread 0x7fffd0fff000 (LWP 22496)]
[New Thread 0x7fffcbfff000 (LWP 22497)]
[New Thread 0x7fffcb7fe000 (LWP 22498)]
[New Thread 0x7fffcaffd000 (LWP 22499)]
[New Thread 0x7fffca7fc000 (LWP 22500)]
[New Thread 0x7fffc9ffb000 (LWP 22501)]
[New Thread 0x7fffc97fa000 (LWP 22502)]
[New Thread 0x7fffc8ff9000 (LWP 22503)]
[New Thread 0x7fffc87f8000 (LWP 22504)]
[New Thread 0x7fffc7ff7000 (LWP 22505)]
[New Thread 0x7fffc77f6000 (LWP 22506)]
[New Thread 0x7fffc6ff5000 (LWP 22507)]
[New Thread 0x7fffc67f4000 (LWP 22508)]
[New Thread 0x7fffc5ff3000 (LWP 22509)]
[New Thread 0x7fffc57f2000 (LWP 22510)]
[New Thread 0x7fffc4ff1000 (LWP 22511)]
[New Thread 0x7fffc47f0000 (LWP 22512)]
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
[New Thread 0x7fffc3fef000 (LWP 22513)]
[New Thread 0x7fffc37ee000 (LWP 22514)]
[New Thread 0x7fffc2fed000 (LWP 22515)]
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  23:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW) 
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
llm_load_print_meta: max token length = 21
[New Thread 0x7fffc27ec000 (LWP 22516)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
[New Thread 0x7fffc09de000 (LWP 22517)]
[New Thread 0x7fffb5fff000 (LWP 22518)]
llm_load_tensors: ggml ctx size =    0.10 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1152.00 MiB
llama_new_context_with_model: KV self size  = 1152.00 MiB, K (f16):  576.00 MiB, V (f16):  576.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    22.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     2.51 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
[1724369443] warming up the model with an empty run
/usr/include/c++/14.2.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

Thread 1 "ollama_llama_se" received signal SIGABRT, Aborted.
0x00007fffef4a7a7c in ?? () from /usr/lib/libc.so.6
#0  0x00007fffef4a7a7c in ?? () from /usr/lib/libc.so.6
#1  0x00007fffef452508 in raise () from /usr/lib/libc.so.6
#2  0x00007fffef43a4bb in abort () from /usr/lib/libc.so.6
#3  0x00007fffef6d3bb0 in std::__glibcxx_assert_fail (file=file@entry=0x5555558fd5a0 "/usr/include/c++/14.2.1/bits/stl_vector.h", line=line@entry=1130, function=function@entry=0x555555909a90 "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type"..., condition=condition@entry=0x5555558d78a3 "__n < this->size()") at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/assert_fail.cc:41
#4  0x00005555556d6149 in llama_set_inputs (lctx=..., batch=...) at /usr/include/c++/14.2.1/bits/stl_vector.h:1128
#5  0x000055555574687a in llama_decode_internal(llama_context&, llama_batch) [clone .isra.0] (lctx=..., batch_all=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/src/llama.cpp:14639
#6  0x00005555557093cc in llama_decode (ctx=0x55556f5d1c80, batch=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/src/llama.cpp:18352
#7  llama_init_from_gpt_params (params=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/common/common.cpp:2160
#8  0x00005555555d523d in llama_server_context::load_model (this=0x7fffffffc740, params_=...) at /usr/src/debug/ollama/ollama-cuda/llm/ext_server/server.cpp:406
#9  0x0000555555596a20 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/ollama/ollama-cuda/llm/ext_server/server.cpp:3031
No function contains program counter for selected frame.
rax            0x0                 0
rbx            0x57dd              22493
rcx            0x7fffef4a7a7c      140737208023676
rdx            0x6                 6
rsi            0x57dd              22493
rdi            0x57dd              22493
rbp            0x7ffff7f2c000      0x7ffff7f2c000
rsp            0x7fffffffad40      0x7fffffffad40
r8             0x0                 0
r9             0xffffffee          4294967278
r10            0x8                 8
r11            0x246               582
r12            0x10                16
r13            0x6                 6
r14            0x2                 2
r15            0x7fffffffb110      140737488335120
rip            0x7fffef4a7a7c      0x7fffef4a7a7c
eflags         0x246               [ PF ZF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
fs_base        0x7ffff7f2c000      140737353269248
gs_base        0x0                 0
zsh: command not found: zsh:
zsh: command not found: --model:
zsh: command not found: No
zsh: command not found: Use
zsh: command not found: No
zsh: command not found: No
zsh: command not found: The
 ✘ noe@noe-systemproductname  ~  which ollama
/usr/bin/ollama

<!-- gh-comment-id:2305915453 --> @Milor123 commented on GitHub (Aug 22, 2024): > This looks like ollama is not running. What's the output of `service ollama status`? Ohh excuseme, that idiot hahaha sorry, the new output with ollama serve loaded ``` gdb --batch -ex="set confirm off" -ex="set pagination off" -ex=r -ex=bt -ex=disassemble -ex="i r" -ex=q --args \ $(ls /tmp/ollama*/runners/cuda_v12/ollama_llama_server | tail -1) \ --model /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 \ --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --parallel 4 --port 46199 zsh: no matches found: /tmp/ollama*/runners/cuda_v12/ollama_llama_server --model: No existe el fichero o el directorio. No executable file specified. Use the "file" or "exec-file" command. No stack. No frame selected. The program has no registers now. This GDB supports auto-downloading debuginfo from the following URLs: <https://debuginfod.archlinux.org> Debuginfod has been disabled. To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit. [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". INFO [main] build info | build=3535 commit="1e6f6554a" tid="140737353269248" timestamp=1724369443 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140737353269248" timestamp=1724369443 total_threads=20 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="46199" tid="140737353269248" timestamp=1724369443 [New Thread 0x7fffd0fff000 (LWP 22496)] [New Thread 0x7fffcbfff000 (LWP 22497)] [New Thread 0x7fffcb7fe000 (LWP 22498)] [New Thread 0x7fffcaffd000 (LWP 22499)] [New Thread 0x7fffca7fc000 (LWP 22500)] [New Thread 0x7fffc9ffb000 (LWP 22501)] [New Thread 0x7fffc97fa000 (LWP 22502)] [New Thread 0x7fffc8ff9000 (LWP 22503)] [New Thread 0x7fffc87f8000 (LWP 22504)] [New Thread 0x7fffc7ff7000 (LWP 22505)] [New Thread 0x7fffc77f6000 (LWP 22506)] [New Thread 0x7fffc6ff5000 (LWP 22507)] [New Thread 0x7fffc67f4000 (LWP 22508)] [New Thread 0x7fffc5ff3000 (LWP 22509)] [New Thread 0x7fffc57f2000 (LWP 22510)] [New Thread 0x7fffc4ff1000 (LWP 22511)] [New Thread 0x7fffc47f0000 (LWP 22512)] llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /home/noe/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert [New Thread 0x7fffc3fef000 (LWP 22513)] [New Thread 0x7fffc37ee000 (LWP 22514)] [New Thread 0x7fffc2fed000 (LWP 22515)] llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.2032 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = nomic-bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 137M llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 136.73 M llm_load_print_meta: model size = 260.86 MiB (16.00 BPW) llm_load_print_meta: general.name = nomic-embed-text-v1.5 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' llm_load_print_meta: max token length = 21 [New Thread 0x7fffc27ec000 (LWP 22516)] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes [New Thread 0x7fffc09de000 (LWP 22517)] [New Thread 0x7fffb5fff000 (LWP 22518)] llm_load_tensors: ggml ctx size = 0.10 MiB llm_load_tensors: offloading 12 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 13/13 layers to GPU llm_load_tensors: CPU buffer size = 44.72 MiB llm_load_tensors: CUDA0 buffer size = 216.15 MiB llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1152.00 MiB llama_new_context_with_model: KV self size = 1152.00 MiB, K (f16): 576.00 MiB, V (f16): 576.00 MiB llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 22.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 2.51 MiB llama_new_context_with_model: graph nodes = 453 llama_new_context_with_model: graph splits = 2 [1724369443] warming up the model with an empty run /usr/include/c++/14.2.1/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. Thread 1 "ollama_llama_se" received signal SIGABRT, Aborted. 0x00007fffef4a7a7c in ?? () from /usr/lib/libc.so.6 #0 0x00007fffef4a7a7c in ?? () from /usr/lib/libc.so.6 #1 0x00007fffef452508 in raise () from /usr/lib/libc.so.6 #2 0x00007fffef43a4bb in abort () from /usr/lib/libc.so.6 #3 0x00007fffef6d3bb0 in std::__glibcxx_assert_fail (file=file@entry=0x5555558fd5a0 "/usr/include/c++/14.2.1/bits/stl_vector.h", line=line@entry=1130, function=function@entry=0x555555909a90 "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type"..., condition=condition@entry=0x5555558d78a3 "__n < this->size()") at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/assert_fail.cc:41 #4 0x00005555556d6149 in llama_set_inputs (lctx=..., batch=...) at /usr/include/c++/14.2.1/bits/stl_vector.h:1128 #5 0x000055555574687a in llama_decode_internal(llama_context&, llama_batch) [clone .isra.0] (lctx=..., batch_all=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/src/llama.cpp:14639 #6 0x00005555557093cc in llama_decode (ctx=0x55556f5d1c80, batch=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/src/llama.cpp:18352 #7 llama_init_from_gpt_params (params=...) at /usr/src/debug/ollama/ollama-cuda/llm/llama.cpp/common/common.cpp:2160 #8 0x00005555555d523d in llama_server_context::load_model (this=0x7fffffffc740, params_=...) at /usr/src/debug/ollama/ollama-cuda/llm/ext_server/server.cpp:406 #9 0x0000555555596a20 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/ollama/ollama-cuda/llm/ext_server/server.cpp:3031 No function contains program counter for selected frame. rax 0x0 0 rbx 0x57dd 22493 rcx 0x7fffef4a7a7c 140737208023676 rdx 0x6 6 rsi 0x57dd 22493 rdi 0x57dd 22493 rbp 0x7ffff7f2c000 0x7ffff7f2c000 rsp 0x7fffffffad40 0x7fffffffad40 r8 0x0 0 r9 0xffffffee 4294967278 r10 0x8 8 r11 0x246 582 r12 0x10 16 r13 0x6 6 r14 0x2 2 r15 0x7fffffffb110 140737488335120 rip 0x7fffef4a7a7c 0x7fffef4a7a7c eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fs_base 0x7ffff7f2c000 140737353269248 gs_base 0x0 0 zsh: command not found: zsh: zsh: command not found: --model: zsh: command not found: No zsh: command not found: Use zsh: command not found: No zsh: command not found: No zsh: command not found: The ✘ noe@noe-systemproductname  ~  which ollama /usr/bin/ollama ```
Author
Owner

@rick-github commented on GitHub (Aug 22, 2024):

It looks like llama_set_inputs() is doing something which causes an assert fail inside the glibc library. Unfortunately because you are using a distro built version, the line numbers don't align with the source code in the repo. Could you try installing an official version (curl -fsSL https://ollama.com/install.sh | sh) and see if the problem persists? Note that you'll have to unistall the current version and remove /tmp/ollama* directories.

<!-- gh-comment-id:2305928390 --> @rick-github commented on GitHub (Aug 22, 2024): It looks like `llama_set_inputs()` is doing something which causes an assert fail inside the glibc library. Unfortunately because you are using a distro built version, the line numbers don't align with the source code in the repo. Could you try installing an official version (`curl -fsSL https://ollama.com/install.sh | sh`) and see if the problem persists? Note that you'll have to unistall the current version and remove /tmp/ollama* directories.
Author
Owner

@Milor123 commented on GitHub (Aug 23, 2024):

It looks like llama_set_inputs() is doing something which causes an assert fail inside the glibc library. Unfortunately because you are using a distro built version, the line numbers don't align with the source code in the repo. Could you try installing an official version (curl -fsSL https://ollama.com/install.sh | sh) and see if the problem persists? Note that you'll have to unistall the current version and remove /tmp/ollama* directories.

Ohh sure, I am using a glibc moded for gamming, maybe is it, thanks to you I've tried use ollama over docker

sudo pacman -S nvidia-container-toolkit

and then

podman run -d --gpus=all -v $HOME/.ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama`

and now works without problem
Thank u very much man

<!-- gh-comment-id:2305970013 --> @Milor123 commented on GitHub (Aug 23, 2024): > It looks like `llama_set_inputs()` is doing something which causes an assert fail inside the glibc library. Unfortunately because you are using a distro built version, the line numbers don't align with the source code in the repo. Could you try installing an official version (`curl -fsSL https://ollama.com/install.sh | sh`) and see if the problem persists? Note that you'll have to unistall the current version and remove /tmp/ollama* directories. Ohh sure, I am using a glibc moded for gamming, maybe is it, thanks to you I've tried use ollama over docker ``` sudo pacman -S nvidia-container-toolkit ``` and then ``` podman run -d --gpus=all -v $HOME/.ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama` ``` and now works without problem Thank u very much man
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4056