[GH-ISSUE #7534] Performance Regression in Ollama 0.4.0 Compared to 0.3.14 #51302

Closed
opened 2026-04-28 19:20:03 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @MMaturax on GitHub (Nov 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7534

What is the issue?

Hello,

After updating to Ollama version 0.4.0, which was noted to have performance improvements, I conducted some performance tests and observed that version 0.3.14 outperformed 0.4.0 in certain cases on my system.

Here are the specifics:

Ollama Version 0.4.0 Test Results (Average speed of 70-78 tokens/second.):
0 4 0

Ollama Version 0.3.14 Test Results (Average speed of 82-88 tokens/second):
0 3 14_2

Could you provide insight into why version 0.4.0 is performing slower when an increase in speed was expected?

Thank you.

System Information:

OS: Ubuntu 24.04.1 LTS
CPU: AMD Ryzen 9 7950X3D 16-Core Processor
GPU: GeForce RTX 4070 Ti SUPER
Driver Version: 550.120
CUDA Version: 12.4

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.4.0

Originally created by @MMaturax on GitHub (Nov 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7534 ### What is the issue? Hello, After updating to Ollama version 0.4.0, which was noted to have performance improvements, I conducted some performance tests and observed that version 0.3.14 outperformed 0.4.0 in certain cases on my system. Here are the specifics: Ollama Version 0.4.0 Test Results (Average speed of 70-78 tokens/second.): ![0 4 0](https://github.com/user-attachments/assets/1459e08d-0386-4d48-b282-e08b8cb08877) Ollama Version 0.3.14 Test Results (Average speed of 82-88 tokens/second): ![0 3 14_2](https://github.com/user-attachments/assets/870d7d3a-f7bc-4620-afe4-e899adc22a83) Could you provide insight into why version 0.4.0 is performing slower when an increase in speed was expected? Thank you. **System Information:** OS: Ubuntu 24.04.1 LTS CPU: AMD Ryzen 9 7950X3D 16-Core Processor GPU: GeForce RTX 4070 Ti SUPER Driver Version: 550.120 CUDA Version: 12.4 ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.0
GiteaMirror added the performancebugnvidia labels 2026-04-28 19:20:04 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 6, 2024):

gemma2

Version Avg tps 1SD
0.3.0 59.6673 0.112403
0.3.1 59.9044 0.100435
0.3.2 59.9028 0.0212802
0.3.3 59.8393 0.0985282
0.3.4 59.7704 0.0692319
0.3.5 59.8649 0.0551121
0.3.6 59.768 0.0821416
0.3.7 61.5798 0.0784618
0.3.8 61.5048 0.236138
0.3.9 61.505 0.150458
0.3.10 61.6102 0.0582519
0.3.11 61.6131 0.11853
0.3.12 61.5735 0.136105
0.3.13 61.5605 0.267206
0.3.14 62.3933 0.0306039
0.4.0 61.3178 0.479437

llama3.1

Version Avg tps 1SD
0.3.1 76.6885 0.0579047
0.3.2 76.7209 0.0636836
0.3.3 76.7105 0.0459862
0.3.4 76.5292 0.0506212
0.3.5 76.6054 0.0340236
0.3.6 76.6183 0.0427944
0.3.7 78.7352 0.144414
0.3.8 78.7267 0.13557
0.3.9 78.7356 0.1259
0.3.10 78.7975 0.116413
0.3.11 78.7261 0.107131
0.3.12 78.667 0.104676
0.3.13 78.7054 0.0939513
0.3.14 79.3558 0.157219
0.4.0 78.8116 0.251901

RTX 4070

<!-- gh-comment-id:2460958642 --> @rick-github commented on GitHub (Nov 6, 2024): ## gemma2 | Version | Avg tps| 1SD | |---|---|---| | 0.3.0 | 59.6673 | 0.112403 | | 0.3.1 | 59.9044 | 0.100435 | | 0.3.2 | 59.9028 | 0.0212802 | | 0.3.3 | 59.8393 | 0.0985282 | | 0.3.4 | 59.7704 | 0.0692319 | | 0.3.5 | 59.8649 | 0.0551121 | | 0.3.6 | 59.768 | 0.0821416 | | 0.3.7 | 61.5798 | 0.0784618 | | 0.3.8 | 61.5048 | 0.236138 | | 0.3.9 | 61.505 | 0.150458 | | 0.3.10 | 61.6102 | 0.0582519 | | 0.3.11 | 61.6131 | 0.11853 | | 0.3.12 | 61.5735 | 0.136105 | | 0.3.13 | 61.5605 | 0.267206 | | 0.3.14 | 62.3933 | 0.0306039 | | 0.4.0 | 61.3178 | 0.479437 | ## llama3.1 | Version | Avg tps | 1SD | | --- | --- | --- | | 0.3.1 | 76.6885 | 0.0579047 | | 0.3.2 | 76.7209 | 0.0636836 | | 0.3.3 | 76.7105 | 0.0459862 | | 0.3.4 | 76.5292 | 0.0506212 | | 0.3.5 | 76.6054 | 0.0340236 | | 0.3.6 | 76.6183 | 0.0427944 | | 0.3.7 | 78.7352 | 0.144414 | | 0.3.8 | 78.7267 | 0.13557 | | 0.3.9 | 78.7356 | 0.1259 | | 0.3.10 | 78.7975 | 0.116413 | | 0.3.11 | 78.7261 | 0.107131 | | 0.3.12 | 78.667 | 0.104676 | | 0.3.13 | 78.7054 | 0.0939513 | | 0.3.14 | 79.3558 | 0.157219 | | 0.4.0 | 78.8116 | 0.251901 | RTX 4070
Author
Owner

@MMaturax commented on GitHub (Nov 6, 2024):

In my tests, there is an average difference of 10 tokens/s with (gemma2:9b), and because of this, I had to revert to version 0.3.14.

I would appreciate it if you could share your test results on how the performance is with versions 0.3.14 / 0.4.0 on different graphics cards for the same model.

<!-- gh-comment-id:2460992854 --> @MMaturax commented on GitHub (Nov 6, 2024): In my tests, there is an average difference of 10 tokens/s with (gemma2:9b), and because of this, I had to revert to version 0.3.14. I would appreciate it if you could share your test results on how the performance is with versions 0.3.14 / 0.4.0 on different graphics cards for the same model.
Author
Owner

@rick-github commented on GitHub (Nov 7, 2024):

gemma2:9b-instruct-q4_0

Version Tesla T4 σ RTX 3080 L σ RTX 4070 σ A100 σ
0.3.0 28.42 1.22 52.66 0.05 59.61 0.11 59.16 0.16
0.3.1 26.54 0.85 52.89 0.06 58.31 0.41 60.48 0.19
0.3.2 26.28 0.89 52.75 0.09 59.13 0.84 60.56 0.16
0.3.3 26.40 0.95 52.80 0.08 59.81 0.04 60.53 0.18
0.3.4 26.39 0.93 52.80 0.06 59.70 0.08 59.84 0.04
0.3.5 26.52 0.88 52.90 0.06 59.74 0.06 60.82 0.11
0.3.6 26.48 0.93 52.82 0.07 59.77 0.03 60.60 0.18
0.3.7 26.05 0.89 55.42 0.10 61.65 0.06 66.18 0.32
0.3.8 26.09 0.92 55.43 0.07 61.57 0.21 66.28 0.15
0.3.9 26.07 0.91 55.44 0.07 61.57 0.26 66.21 0.16
0.3.10 26.24 0.96 55.40 0.08 61.60 0.04 66.00 0.15
0.3.11 26.20 0.93 55.34 0.09 61.58 0.05 66.51 0.50
0.3.12 26.21 0.89 55.33 0.06 61.55 0.17 66.59 0.39
0.3.13 26.25 0.87 55.32 0.07 61.55 0.26 66.95 0.18
0.3.14 26.26 0.87 55.78 0.09 62.31 0.03 67.74 0.16
0.4.0 25.72 0.79 55.14 0.09 61.29 0.47 64.88 0.50
<!-- gh-comment-id:2461066429 --> @rick-github commented on GitHub (Nov 7, 2024): ## gemma2:9b-instruct-q4_0 | Version | Tesla T4 | σ | RTX 3080 L | σ | RTX 4070 | σ | A100 | σ | |---------|----------|------------|-----------------|-------------------|-----------|-------------|-----------------|-------------------| | 0.3.0 | 28.42 | 1.22 | 52.66 | 0.05 | 59.61 | 0.11 | 59.16 | 0.16 | | 0.3.1 | 26.54 | 0.85 | 52.89 | 0.06 | 58.31 | 0.41 | 60.48 | 0.19 | | 0.3.2 | 26.28 | 0.89 | 52.75 | 0.09 | 59.13 | 0.84 | 60.56 | 0.16 | | 0.3.3 | 26.40 | 0.95 | 52.80 | 0.08 | 59.81 | 0.04 | 60.53 | 0.18 | | 0.3.4 | 26.39 | 0.93 | 52.80 | 0.06 | 59.70 | 0.08 | 59.84 | 0.04 | | 0.3.5 | 26.52 | 0.88 | 52.90 | 0.06 | 59.74 | 0.06 | 60.82 | 0.11 | | 0.3.6 | 26.48 | 0.93 | 52.82 | 0.07 | 59.77 | 0.03 | 60.60 | 0.18 | | 0.3.7 | 26.05 | 0.89 | 55.42 | 0.10 | 61.65 | 0.06 | 66.18 | 0.32 | | 0.3.8 | 26.09 | 0.92 | 55.43 | 0.07 | 61.57 | 0.21 | 66.28 | 0.15 | | 0.3.9 | 26.07 | 0.91 | 55.44 | 0.07 | 61.57 | 0.26 | 66.21 | 0.16 | | 0.3.10 | 26.24 | 0.96 | 55.40 | 0.08 | 61.60 | 0.04 | 66.00 | 0.15 | | 0.3.11 | 26.20 | 0.93 | 55.34 | 0.09 | 61.58 | 0.05 | 66.51 | 0.50 | | 0.3.12 | 26.21 | 0.89 | 55.33 | 0.06 | 61.55 | 0.17 | 66.59 | 0.39 | | 0.3.13 | 26.25 | 0.87 | 55.32 | 0.07 | 61.55 | 0.26 | 66.95 | 0.18 | | 0.3.14 | 26.26 | 0.87 | 55.78 | 0.09 | 62.31 | 0.03 | 67.74 | 0.16 | | 0.4.0 | 25.72 | 0.79 | 55.14 | 0.09 | 61.29 | 0.47 | 64.88 | 0.50 |
Author
Owner

@Readon commented on GitHub (Nov 7, 2024):

The same here with 2x4090.
run Nemotron model:
0.4.0: 16.15 t/s
0.3.14: 18.85 t/s
vLLM: 34 t/s

I am also wondering that why vLLM could run model with 100% GPU utilization but ollama could not.

<!-- gh-comment-id:2461146219 --> @Readon commented on GitHub (Nov 7, 2024): The same here with 2x4090. run Nemotron model: 0.4.0: 16.15 t/s 0.3.14: 18.85 t/s vLLM: 34 t/s I am also wondering that why vLLM could run model with 100% GPU utilization but ollama could not.
Author
Owner

@endedasende commented on GitHub (Nov 8, 2024):

Experiencing same regression
llama3.2:latest 3GB, RTX 4070 12gb
0.3.14: 128 t/s
0.4.0: 74 t/s

Promise of better performance on nvidia 40xx was on 0.4.0 pre release change notes, but was removed on 0.4.0 release.
We might yet expect that.

<!-- gh-comment-id:2465370331 --> @endedasende commented on GitHub (Nov 8, 2024): Experiencing same regression llama3.2:latest 3GB, RTX 4070 12gb 0.3.14: 128 t/s 0.4.0: 74 t/s Promise of better performance on nvidia 40xx was on 0.4.0 pre release change notes, but was removed on 0.4.0 release. We might yet expect that.
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

A performance difference in the order of 10s of t/s is not expected, tho. The most likely explanation is that more of the model is being run in the CPU. Can you post logs from both versions loading the same model? It might reveal why there's such a delta in performance.

<!-- gh-comment-id:2465424672 --> @rick-github commented on GitHub (Nov 8, 2024): A performance difference in the order of 10s of t/s is not expected, tho. The most likely explanation is that more of the model is being run in the CPU. Can you post logs from both versions loading the same model? It might reveal why there's such a delta in performance.
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

RTX 4070 12G

gemma2: gemma2:9b-instruct-q4_0
llama3.1: llama3.1:8b-instruct-q4_K_M
llama3.2: llama3.2:3b-instruct-q4_K_M

version gemma2 llama3.1 llama3.2
0.3.1 59.80 ± 0.09 76.70 ± 0.03 139.46 ± 0.07
0.3.2 59.82 ± 0.08 76.71 ± 0.07 139.51 ± 0.24
0.3.3 59.87 ± 0.04 76.66 ± 0.08 139.41 ± 0.12
0.3.4 59.71 ± 0.06 76.59 ± 0.06 139.07 ± 0.18
0.3.5 59.74 ± 0.07 76.61 ± 0.05 139.27 ± 0.14
0.3.6 59.72 ± 0.05 76.62 ± 0.06 138.88 ± 0.19
0.3.7 61.63 ± 0.06 78.71 ± 0.09 144.90 ± 0.77
0.3.8 61.61 ± 0.05 78.74 ± 0.13 145.17 ± 0.18
0.3.9 61.53 ± 0.21 78.75 ± 0.05 144.91 ± 0.33
0.3.10 61.50 ± 0.18 78.72 ± 0.15 145.25 ± 0.09
0.3.11 61.53 ± 0.09 78.70 ± 0.15 145.34 ± 0.40
0.3.12 61.60 ± 0.04 78.70 ± 0.12 145.23 ± 0.11
0.3.13 61.49 ± 0.16 78.67 ± 0.13 145.20 ± 0.12
0.3.14 62.31 ± 0.30 79.34 ± 0.14 147.35 ± 0.52
0.4.0 61.50 ± 0.08 78.75 ± 0.26 141.62 ± 0.38
<!-- gh-comment-id:2465588093 --> @rick-github commented on GitHub (Nov 8, 2024): RTX 4070 12G gemma2: gemma2:9b-instruct-q4_0 llama3.1: llama3.1:8b-instruct-q4_K_M llama3.2: llama3.2:3b-instruct-q4_K_M | version | gemma2 | llama3.1 | llama3.2 | |---------|----------------------|---------------------------|---------------------------| | 0.3.1 | 59.80 ± 0.09 | 76.70 ± 0.03 | 139.46 ± 0.07 | | 0.3.2 | 59.82 ± 0.08 | 76.71 ± 0.07 | 139.51 ± 0.24 | | 0.3.3 | 59.87 ± 0.04 | 76.66 ± 0.08 | 139.41 ± 0.12 | | 0.3.4 | 59.71 ± 0.06 | 76.59 ± 0.06 | 139.07 ± 0.18 | | 0.3.5 | 59.74 ± 0.07 | 76.61 ± 0.05 | 139.27 ± 0.14 | | 0.3.6 | 59.72 ± 0.05 | 76.62 ± 0.06 | 138.88 ± 0.19 | | 0.3.7 | 61.63 ± 0.06 | 78.71 ± 0.09 | 144.90 ± 0.77 | | 0.3.8 | 61.61 ± 0.05 | 78.74 ± 0.13 | 145.17 ± 0.18 | | 0.3.9 | 61.53 ± 0.21 | 78.75 ± 0.05 | 144.91 ± 0.33 | | 0.3.10 | 61.50 ± 0.18 | 78.72 ± 0.15 | 145.25 ± 0.09 | | 0.3.11 | 61.53 ± 0.09 | 78.70 ± 0.15 | 145.34 ± 0.40 | | 0.3.12 | 61.60 ± 0.04 | 78.70 ± 0.12 | 145.23 ± 0.11 | | 0.3.13 | 61.49 ± 0.16 | 78.67 ± 0.13 | 145.20 ± 0.12 | | 0.3.14 | 62.31 ± 0.30 | 79.34 ± 0.14 | 147.35 ± 0.52 | | 0.4.0 | 61.50 ± 0.08 | 78.75 ± 0.26 | 141.62 ± 0.38 |
Author
Owner

@MMaturax commented on GitHub (Nov 10, 2024):

I am sharing the detailed logs for you to review.


(base) user@materpc:~$ ollama --version
ollama version is 0.3.14

----------------

>>> 5+2/18=?
Here's how to solve the problem:

**Remember PEMDAS:**

* **P**arentheses/ **B**rackets
* **E**xponents/ **O**rders
* **M**ultiplication and **D**ivision (from left to right)
* **A**ddition and **S**ubtraction (from left to right)


1. **Division:** 2 ÷ 18 = 0.1111 (approximately)

2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)




Therefore, 5 + 2/18 ≈ **5.1111**

total duration:       2.058563919s
load duration:        19.270895ms
prompt eval count:    295 token(s)
prompt eval duration: 31.971ms
prompt eval rate:     9227.11 tokens/s
eval count:           150 token(s)
eval duration:        1.757547s
eval rate:            85.35 tokens/s

----------



>>> 5+2/18=?
Here's how to solve the problem:

**Remember PEMDAS:**

* **P**arentheses/ **B**rackets
* **E**xponents/ **O**rders
* **M**ultiplication and **D**ivision (from left to right)
* **A**ddition and **S**ubtraction (from left to right)


1. **Division:** 2 ÷ 18 = 0.1111 (approximately)

2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)




Therefore, 5 + 2/18 ≈ **5.1111**

total duration:       2.058563919s
load duration:        19.270895ms
prompt eval count:    295 token(s)
prompt eval duration: 31.971ms
prompt eval rate:     9227.11 tokens/s
eval count:           150 token(s)
eval duration:        1.757547s
eval rate:            85.35 tokens/s

----------------

(base) user@materpc:~$ ollama ps
NAME             ID              SIZE      PROCESSOR    UNTIL              
gemma2:latest    ff02c3702f32    9.4 GB    100% GPU     4 minutes from now  


----------------


Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.829Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB"
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.893Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="60.0 GiB" free_swap="8.0 GiB"
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.893Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1754208688/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 43 --threads 16 --flash-attn --parallel 4 --port 36341"
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
Nov 10 22:44:34 materpc ollama[129282]: INFO [main] starting c++ runner | tid="129353907965952" timestamp=1731278674
Nov 10 22:44:34 materpc ollama[129282]: INFO [main] build info | build=10 commit="3a8c75e" tid="129353907965952" timestamp=1731278674
Nov 10 22:44:34 materpc ollama[129282]: INFO [main] system info | n_threads=16 n_threads_batch=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="129353907965952" timestamp=1731278674 total_threads=32
Nov 10 22:44:34 materpc ollama[129282]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36341" tid="129353907965952" timestamp=1731278674
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   0:                       general.architecture str              = gemma2
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type  f32:  169 tensors
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type q4_0:  294 tensors
Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type q6_K:    1 tensors
Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: special tokens cache size = 108
Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: token to piece cache size = 1.6014 MB
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: arch             = gemma2
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: vocab type       = SPM
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_vocab          = 256000
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_merges         = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: vocab_only       = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ctx_train      = 8192
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd           = 3584
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_layer          = 42
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_head           = 16
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_head_kv        = 8
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_rot            = 256
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_swa            = 4096
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_head_k    = 256
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_head_v    = 256
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_gqa            = 2
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_k_gqa     = 2048
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_v_gqa     = 2048
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ff             = 14336
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_expert         = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_expert_used    = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: causal attn      = 1
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: pooling type     = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope type        = 2
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope scaling     = linear
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: freq_base_train  = 10000.0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: freq_scale_train = 1
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope_finetuned   = unknown
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_conv       = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_inner      = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_state      = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_dt_rank      = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model type       = 9B
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model ftype      = Q4_0
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model params     = 9.24 B
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW)
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: general.name     = gemma-2-9b-it
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: BOS token        = 2 '<bos>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOS token        = 1 '<eos>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: UNK token        = 3 '<unk>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: PAD token        = 0 '<pad>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: LF token         = 227 '<0x0A>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOG token        = 1 '<eos>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: max token length = 93
Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: found 1 CUDA devices:
Nov 10 22:44:35 materpc ollama[2103]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Nov 10 22:44:35 materpc ollama[2103]: llm_load_tensors: ggml ctx size =    0.41 MiB
Nov 10 22:44:35 materpc ollama[2103]: time=2024-11-10T22:44:35.146Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
Nov 10 22:44:40 materpc ollama[2103]: [GIN] 2024/11/10 - 22:44:40 | 200 |    5.540022ms |    192.168.1.55 | GET      "/api/tags"
Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloading 42 repeating layers to GPU
Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloading non-repeating layers to GPU
Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloaded 43/43 layers to GPU
Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors:        CPU buffer size =   717.77 MiB
Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors:      CUDA0 buffer size =  5185.21 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_ctx      = 8192
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_batch    = 512
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_ubatch   = 512
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: flash_attn = 1
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: freq_base  = 10000.0
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: freq_scale = 1
Nov 10 22:44:44 materpc ollama[2103]: llama_kv_cache_init:      CUDA0 KV buffer size =  2688.00 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: KV self size  = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model:  CUDA_Host  output buffer size =     3.96 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model:      CUDA0 compute buffer size =   507.00 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model:  CUDA_Host compute buffer size =    39.01 MiB
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: graph nodes  = 1398
Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: graph splits = 2
Nov 10 22:44:44 materpc ollama[129282]: INFO [main] model loaded | tid="129353907965952" timestamp=1731278684
Nov 10 22:44:44 materpc ollama[2103]: time=2024-11-10T22:44:44.418Z level=INFO source=server.go:626 msg="llama runner started in 9.52 seconds"
Nov 10 22:44:44 materpc ollama[2103]: [GIN] 2024/11/10 - 22:44:44 | 200 |  9.718449438s |       127.0.0.1 | POST     "/api/generate"

(base) user@materpc:~$ ollama --version
ollama version is 0.4.1

-----------

>>> 5+2/18=?
Here's how to solve it:

**1. Division:** 2 divided by 18 is 2/18, which simplifies to 1/9.

**2. Addition:**  5 plus 1/9 is 5 + (1/9) = 46/9


**Answer:** 5 + 2/18 = 46/9 or approximately 5.11.

total duration:       1.172501385s
load duration:        24.592101ms
prompt eval count:    289 token(s)
prompt eval duration: 16ms
prompt eval rate:     18062.50 tokens/s
eval count:           91 token(s)
eval duration:        1.129s
eval rate:            80.60 tokens/s

------------

(base) user@materpc:~$ ollama ps
NAME             ID              SIZE      PROCESSOR    UNTIL              
gemma2:latest    ff02c3702f32    9.4 GB    100% GPU     4 minutes from now 

---------------

Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.080Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.139Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="59.9 GiB" free_swap="8.0 GiB"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.140Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.140Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1666998563/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --n-gpu-layers 43 --threads 16 --parallel 4 --port 42589"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=runner.go:863 msg="starting go runner"
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:42589"
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   0:                       general.architecture str              = gemma2
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type  f32:  169 tensors
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type q4_0:  294 tensors
Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type q6_K:    1 tensors
Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: special tokens cache size = 108
Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: token to piece cache size = 1.6014 MB
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: arch             = gemma2
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: vocab type       = SPM
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_vocab          = 256000
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_merges         = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: vocab_only       = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ctx_train      = 8192
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd           = 3584
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_layer          = 42
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_head           = 16
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_head_kv        = 8
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_rot            = 256
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_swa            = 4096
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_head_k    = 256
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_head_v    = 256
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_gqa            = 2
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_k_gqa     = 2048
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_v_gqa     = 2048
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ff             = 14336
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_expert         = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_expert_used    = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: causal attn      = 1
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: pooling type     = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope type        = 2
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope scaling     = linear
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: freq_base_train  = 10000.0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: freq_scale_train = 1
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope_finetuned   = unknown
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_conv       = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_inner      = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_state      = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_dt_rank      = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model type       = 9B
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model ftype      = Q4_0
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model params     = 9.24 B
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW)
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: general.name     = gemma-2-9b-it
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: BOS token        = 2 '<bos>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOS token        = 1 '<eos>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: UNK token        = 3 '<unk>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: PAD token        = 0 '<pad>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: LF token         = 227 '<0x0A>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOG token        = 1 '<eos>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: max token length = 93
Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: found 1 CUDA devices:
Nov 10 22:57:46 materpc ollama[130240]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.392Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: ggml ctx size =    0.41 MiB
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloading 42 repeating layers to GPU
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloading non-repeating layers to GPU
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloaded 43/43 layers to GPU
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors:        CPU buffer size =   717.77 MiB
Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors:      CUDA0 buffer size =  5185.21 MiB
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_ctx      = 8192
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_batch    = 2048
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_ubatch   = 512
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: flash_attn = 0
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: freq_base  = 10000.0
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: freq_scale = 1
Nov 10 22:57:46 materpc ollama[130240]: llama_kv_cache_init:      CUDA0 KV buffer size =  2688.00 MiB
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: KV self size  = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB
Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model:  CUDA_Host  output buffer size =     3.96 MiB
Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model:      CUDA0 compute buffer size =   507.00 MiB
Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model:  CUDA_Host compute buffer size =    39.01 MiB
Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: graph nodes  = 1690
Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: graph splits = 2
Nov 10 22:57:47 materpc ollama[130240]: time=2024-11-10T22:57:47.145Z level=INFO source=server.go:601 msg="llama runner started in 1.00 seconds"
Nov 10 22:57:47 materpc ollama[130240]: [GIN] 2024/11/10 - 22:57:47 | 200 |  1.206791565s |       127.0.0.1 | POST     "/api/generate"
Nov 10 22:58:01 materpc ollama[130240]: [GIN] 2024/11/10 - 22:58:01 | 200 |   1.94717142s |       127.0.0.1 | POST     "/api/chat"
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   0:                       general.architecture str              = gemma2
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type  f32:  169 tensors
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type q4_0:  294 tensors
Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type q6_K:    1 tensors
Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: special tokens cache size = 108
Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: token to piece cache size = 1.6014 MB
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: arch             = gemma2
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: vocab type       = SPM
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: n_vocab          = 256000
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: n_merges         = 0
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: vocab_only       = 1
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model type       = ?B
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model ftype      = all F32
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model params     = 9.24 B
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW)
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: general.name     = gemma-2-9b-it
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: BOS token        = 2 '<bos>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOS token        = 1 '<eos>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: UNK token        = 3 '<unk>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: PAD token        = 0 '<pad>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: LF token         = 227 '<0x0A>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOG token        = 1 '<eos>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: max token length = 93
Nov 10 22:58:07 materpc ollama[130240]: llama_model_load: vocab only - skipping tensors
Nov 10 22:58:08 materpc ollama[130240]: [GIN] 2024/11/10 - 22:58:08 | 200 |  1.325050573s |       127.0.0.1 | POST     "/api/chat"

0.4.1 DEBUG ENABLED

(base) user@materpc:~$ ollama run gemma2:latest --verbose
>>> 5+2/18=?
Here's how to solve the problem:

**Order of Operations**

Remember to follow the order of operations (PEMDAS/BODMAS):

* **P**arentheses / **B**rackets
* **E**xponents / **O**rders
* **M**ultiplication and **D**ivision (from left to right)
* **A**ddition and **S**ubtraction (from left to right)

**Calculation**

1. **Division:** 2/18 = 1/9 
2. **Addition:** 5 + 1/9 = 46/9


**Answer**

5 + 2/18 = 46/9  (or approximately 5.11)

total duration:       2.306254951s
load duration:        28.086045ms
prompt eval count:    16 token(s)
prompt eval duration: 12ms
prompt eval rate:     1333.33 tokens/s
eval count:           163 token(s)
eval duration:        2.265s
eval rate:            71.96 tokens/s

-----------------

---------

(base) user@materpc:~$ journalctl -u ollama --no-pager -n300
Nov 10 23:08:32 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20
Nov 10 23:08:32 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850
Nov 10 23:08:32 materpc ollama[131600]: calling cuInit
Nov 10 23:08:32 materpc ollama[131600]: calling cuDriverGetVersion
Nov 10 23:08:32 materpc ollama[131600]: raw version 0x2f08
Nov 10 23:08:32 materpc ollama[131600]: CUDA driver version: 12.4
Nov 10 23:08:32 materpc ollama[131600]: calling cuDeviceGetCount
Nov 10 23:08:32 materpc ollama[131600]: device count 1
Nov 10 23:08:32 materpc ollama[131600]: time=2024-11-10T23:08:32.858Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.120
Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] CUDA totalMem 16071 mb
Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] CUDA freeMem 15857 mb
Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] Compute Capability 8.9
Nov 10 23:08:33 materpc ollama[131600]: time=2024-11-10T23:08:33.016Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu"
Nov 10 23:08:33 materpc ollama[131600]: releasing cuda driver library
Nov 10 23:08:33 materpc ollama[131600]: time=2024-11-10T23:08:33.016Z level=INFO source=types.go:123 msg="inference compute" id=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4070 Ti SUPER" total="15.7 GiB" available="15.5 GiB"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.429Z level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="62.4 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.total="62.4 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
Nov 10 23:08:39 materpc ollama[131600]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.120
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuInit - 0x77d44c87cbc0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDriverGetVersion - 0x77d44c87cbe0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetCount - 0x77d44c87cc20
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGet - 0x77d44c87cc00
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetAttribute - 0x77d44c87cd00
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetUuid - 0x77d44c87cc60
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetName - 0x77d44c87cc40
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxCreate_v3 - 0x77d44c87cee0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850
Nov 10 23:08:39 materpc ollama[131600]: calling cuInit
Nov 10 23:08:39 materpc ollama[131600]: calling cuDriverGetVersion
Nov 10 23:08:39 materpc ollama[131600]: raw version 0x2f08
Nov 10 23:08:39 materpc ollama[131600]: CUDA driver version: 12.4
Nov 10 23:08:39 materpc ollama[131600]: calling cuDeviceGetCount
Nov 10 23:08:39 materpc ollama[131600]: device count 1
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.508Z level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 name="NVIDIA GeForce RTX 4070 Ti SUPER" overhead="0 B" before.total="15.7 GiB" before.free="15.5 GiB" now.total="15.7 GiB" now.free="15.5 GiB" now.used="214.6 MiB"
Nov 10 23:08:39 materpc ollama[131600]: releasing cuda driver library
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.508Z level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x811140 gpu_count=1
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.547Z level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.547Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.5 GiB]"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.548Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.548Z level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="62.4 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.total="62.4 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB"
Nov 10 23:08:39 materpc ollama[131600]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.120
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuInit - 0x77d44c87cbc0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDriverGetVersion - 0x77d44c87cbe0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetCount - 0x77d44c87cc20
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGet - 0x77d44c87cc00
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetAttribute - 0x77d44c87cd00
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetUuid - 0x77d44c87cc60
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetName - 0x77d44c87cc40
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxCreate_v3 - 0x77d44c87cee0
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20
Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850
Nov 10 23:08:39 materpc ollama[131600]: calling cuInit
Nov 10 23:08:39 materpc ollama[131600]: calling cuDriverGetVersion
Nov 10 23:08:39 materpc ollama[131600]: raw version 0x2f08
Nov 10 23:08:39 materpc ollama[131600]: CUDA driver version: 12.4
Nov 10 23:08:39 materpc ollama[131600]: calling cuDeviceGetCount
Nov 10 23:08:39 materpc ollama[131600]: device count 1
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.610Z level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 name="NVIDIA GeForce RTX 4070 Ti SUPER" overhead="0 B" before.total="15.7 GiB" before.free="15.5 GiB" now.total="15.7 GiB" now.free="15.5 GiB" now.used="214.6 MiB"
Nov 10 23:08:39 materpc ollama[131600]: releasing cuda driver library
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="60.1 GiB" free_swap="8.0 GiB"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.5 GiB]"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=rocm payload=linux/amd64/rocm/ollama_llama_server.gz
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx2/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v11/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/rocm/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx2/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v11/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/rocm/ollama_llama_server
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --n-gpu-layers 43 --verbose --threads 16 --flash-attn --parallel 4 --port 35173"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=server.go:400 msg=subprocess environment="[PATH=/home/user/.local/bin:/home/user/miniconda3/bin:/home/user/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/.local/bin:/home/user/.local/bin LD_LIBRARY_PATH=/usr/local/lib/ollama:/tmp/ollama3132615929/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-a371085e-f395-5afa-80a4-9d858159f7d6]"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.638Z level=INFO source=runner.go:863 msg="starting go runner"
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.639Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.639Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:35173"
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   0:                       general.architecture str              = gemma2
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type  f32:  169 tensors
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type q4_0:  294 tensors
Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type q6_K:    1 tensors
Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: special tokens cache size = 108
Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: token to piece cache size = 1.6014 MB
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: arch             = gemma2
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: vocab type       = SPM
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_vocab          = 256000
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_merges         = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: vocab_only       = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ctx_train      = 8192
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd           = 3584
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_layer          = 42
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_head           = 16
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_head_kv        = 8
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_rot            = 256
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_swa            = 4096
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_head_k    = 256
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_head_v    = 256
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_gqa            = 2
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_k_gqa     = 2048
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_v_gqa     = 2048
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ff             = 14336
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_expert         = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_expert_used    = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: causal attn      = 1
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: pooling type     = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope type        = 2
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope scaling     = linear
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: freq_base_train  = 10000.0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: freq_scale_train = 1
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope_finetuned   = unknown
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_conv       = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_inner      = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_state      = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_dt_rank      = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model type       = 9B
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model ftype      = Q4_0
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model params     = 9.24 B
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW)
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: general.name     = gemma-2-9b-it
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: BOS token        = 2 '<bos>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOS token        = 1 '<eos>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: UNK token        = 3 '<unk>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: PAD token        = 0 '<pad>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: LF token         = 227 '<0x0A>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOG token        = 1 '<eos>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: max token length = 93
Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: found 1 CUDA devices:
Nov 10 23:08:39 materpc ollama[131600]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.863Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
Nov 10 23:08:39 materpc ollama[131600]: llm_load_tensors: ggml ctx size =    0.41 MiB
Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloading 42 repeating layers to GPU
Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloading non-repeating layers to GPU
Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloaded 43/43 layers to GPU
Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors:        CPU buffer size =   717.77 MiB
Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors:      CUDA0 buffer size =  5185.21 MiB
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.114Z level=DEBUG source=server.go:607 msg="model load progress 0.12"
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.365Z level=DEBUG source=server.go:607 msg="model load progress 1.00"
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_ctx      = 8192
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_batch    = 2048
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_ubatch   = 512
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: flash_attn = 1
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: freq_base  = 10000.0
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: freq_scale = 1
Nov 10 23:08:40 materpc ollama[131600]: llama_kv_cache_init:      CUDA0 KV buffer size =  2688.00 MiB
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: KV self size  = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model:  CUDA_Host  output buffer size =     3.96 MiB
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model:      CUDA0 compute buffer size =   507.00 MiB
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model:  CUDA_Host compute buffer size =    39.01 MiB
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: graph nodes  = 1398
Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: graph splits = 2
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.615Z level=INFO source=server.go:601 msg="llama runner started in 1.00 seconds"
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.615Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.616Z level=DEBUG source=server.go:955 msg="new runner detected, loading model for cgo tokenization"
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   0:                       general.architecture str              = gemma2
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   1:                               general.name str              = gemma-2-9b-it
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 3584
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   4:                         gemma2.block_count u32              = 42
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 14336
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 16
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 8
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 256
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 256
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv  28:               general.quantization_version u32              = 2
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type  f32:  169 tensors
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type q4_0:  294 tensors
Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type q6_K:    1 tensors
Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: special tokens cache size = 108
Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: token to piece cache size = 1.6014 MB
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: arch             = gemma2
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: vocab type       = SPM
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: n_vocab          = 256000
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: n_merges         = 0
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: vocab_only       = 1
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model type       = ?B
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model ftype      = all F32
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model params     = 9.24 B
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model size       = 5.06 GiB (4.71 BPW)
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: general.name     = gemma-2-9b-it
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: BOS token        = 2 '<bos>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOS token        = 1 '<eos>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: UNK token        = 3 '<unk>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: PAD token        = 0 '<pad>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: LF token         = 227 '<0x0A>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOG token        = 1 '<eos>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: max token length = 93
Nov 10 23:08:40 materpc ollama[131600]: llama_model_load: vocab only - skipping tensors
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.812Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve the problem:\n\n**Remember the order of operations (PEMDAS/BODMAS):**\n\n* **P**arentheses / **B**rackets\n* **E**xponents / **O**rders\n* **M**ultiplication and **D**ivision (from left to right)\n* **A**ddition and **S**ubtraction (from left to right)\n\n1. **Division:** 2/18 = 0.1111 (approximately)\n2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**Remember PEMDAS:**  Parentheses, Exponents, Multiplication and Division (left to right), Addition and Subtraction (left to right).\n\n1. **Division:** 2 / 18 = 0.1111 (approximately)\n2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)\n\n\nSo,  5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**Remember the order of operations (PEMDAS):**\n\n* **P**arentheses / **B**rackets\n* **E**xponents / **O**rders\n* **M**ultiplication and **D**ivision (from left to right)\n* **A**ddition and **S**ubtraction (from left to right)\n\n1. **Division:** 2 divided by 18 is 0.1111 (approximately)\n2. **Addition:**  5 plus 0.1111 is 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**1. Division:** 2 divided by 18 is 0.1111 (approximately)\n\n**2. Addition:**  5 plus 0.1111 is 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\n"
Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.813Z level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=582 used=0 remaining=582
Nov 10 23:08:42 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:42 | 200 |    5.703167ms |    192.168.1.55 | GET      "/api/tags"
Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 |  3.817021475s |       127.0.0.1 | POST     "/api/chat"
Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:466 msg="context for request finished"
Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s
Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 |       16.59µs |       127.0.0.1 | HEAD     "/"
Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 |      64.469µs |       127.0.0.1 | GET      "/api/ps"
Nov 10 23:08:58 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:58 | 200 |    5.268749ms |    192.168.1.55 | GET      "/api/tags"
Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 |       18.65µs |       127.0.0.1 | HEAD     "/"
Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 |   15.703358ms |       127.0.0.1 | POST     "/api/show"
Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.309Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 |   20.230051ms |       127.0.0.1 | POST     "/api/generate"
Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s
Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\n"
Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=737 prompt=16 used=15 remaining=1
Nov 10 23:09:03 materpc ollama[131600]: [GIN] 2024/11/10 - 23:09:03 | 200 |  2.306287181s |       127.0.0.1 | POST     "/api/chat"
Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s
Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
Nov 10 23:09:14 materpc ollama[131600]: [GIN] 2024/11/10 - 23:09:14 | 200 |    4.720042ms |    192.168.1.55 | GET      "/api/tags"
<!-- gh-comment-id:2466982479 --> @MMaturax commented on GitHub (Nov 10, 2024): I am sharing the detailed logs for you to review. ``` (base) user@materpc:~$ ollama --version ollama version is 0.3.14 ---------------- >>> 5+2/18=? Here's how to solve the problem: **Remember PEMDAS:** * **P**arentheses/ **B**rackets * **E**xponents/ **O**rders * **M**ultiplication and **D**ivision (from left to right) * **A**ddition and **S**ubtraction (from left to right) 1. **Division:** 2 ÷ 18 = 0.1111 (approximately) 2. **Addition:** 5 + 0.1111 = 5.1111 (approximately) Therefore, 5 + 2/18 ≈ **5.1111** total duration: 2.058563919s load duration: 19.270895ms prompt eval count: 295 token(s) prompt eval duration: 31.971ms prompt eval rate: 9227.11 tokens/s eval count: 150 token(s) eval duration: 1.757547s eval rate: 85.35 tokens/s ---------- >>> 5+2/18=? Here's how to solve the problem: **Remember PEMDAS:** * **P**arentheses/ **B**rackets * **E**xponents/ **O**rders * **M**ultiplication and **D**ivision (from left to right) * **A**ddition and **S**ubtraction (from left to right) 1. **Division:** 2 ÷ 18 = 0.1111 (approximately) 2. **Addition:** 5 + 0.1111 = 5.1111 (approximately) Therefore, 5 + 2/18 ≈ **5.1111** total duration: 2.058563919s load duration: 19.270895ms prompt eval count: 295 token(s) prompt eval duration: 31.971ms prompt eval rate: 9227.11 tokens/s eval count: 150 token(s) eval duration: 1.757547s eval rate: 85.35 tokens/s ---------------- (base) user@materpc:~$ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma2:latest ff02c3702f32 9.4 GB 100% GPU 4 minutes from now ---------------- Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.829Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB" Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.893Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="60.0 GiB" free_swap="8.0 GiB" Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.893Z level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama1754208688/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --embedding --n-gpu-layers 43 --threads 16 --flash-attn --parallel 4 --port 36341" Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" Nov 10 22:44:34 materpc ollama[2103]: time=2024-11-10T22:44:34.894Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" Nov 10 22:44:34 materpc ollama[129282]: INFO [main] starting c++ runner | tid="129353907965952" timestamp=1731278674 Nov 10 22:44:34 materpc ollama[129282]: INFO [main] build info | build=10 commit="3a8c75e" tid="129353907965952" timestamp=1731278674 Nov 10 22:44:34 materpc ollama[129282]: INFO [main] system info | n_threads=16 n_threads_batch=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="129353907965952" timestamp=1731278674 total_threads=32 Nov 10 22:44:34 materpc ollama[129282]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="36341" tid="129353907965952" timestamp=1731278674 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 0: general.architecture str = gemma2 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 1: general.name str = gemma-2-9b-it Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 4: gemma2.block_count u32 = 42 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 11: general.file_type u32 = 2 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default Nov 10 22:44:34 materpc ollama[2103]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type f32: 169 tensors Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type q4_0: 294 tensors Nov 10 22:44:35 materpc ollama[2103]: llama_model_loader: - type q6_K: 1 tensors Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: special tokens cache size = 108 Nov 10 22:44:35 materpc ollama[2103]: llm_load_vocab: token to piece cache size = 1.6014 MB Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: format = GGUF V3 (latest) Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: arch = gemma2 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: vocab type = SPM Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_vocab = 256000 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_merges = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: vocab_only = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ctx_train = 8192 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd = 3584 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_layer = 42 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_head = 16 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_head_kv = 8 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_rot = 256 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_swa = 4096 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_head_k = 256 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_head_v = 256 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_gqa = 2 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_k_gqa = 2048 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_embd_v_gqa = 2048 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_norm_eps = 0.0e+00 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: f_logit_scale = 0.0e+00 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ff = 14336 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_expert = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_expert_used = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: causal attn = 1 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: pooling type = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope type = 2 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope scaling = linear Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: freq_base_train = 10000.0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: freq_scale_train = 1 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: rope_finetuned = unknown Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_conv = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_inner = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_d_state = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_dt_rank = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model type = 9B Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model ftype = Q4_0 Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model params = 9.24 B Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: general.name = gemma-2-9b-it Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: BOS token = 2 '<bos>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOS token = 1 '<eos>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: UNK token = 3 '<unk>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: PAD token = 0 '<pad>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: LF token = 227 '<0x0A>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOG token = 1 '<eos>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: EOG token = 107 '<end_of_turn>' Nov 10 22:44:35 materpc ollama[2103]: llm_load_print_meta: max token length = 93 Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Nov 10 22:44:35 materpc ollama[2103]: ggml_cuda_init: found 1 CUDA devices: Nov 10 22:44:35 materpc ollama[2103]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Nov 10 22:44:35 materpc ollama[2103]: llm_load_tensors: ggml ctx size = 0.41 MiB Nov 10 22:44:35 materpc ollama[2103]: time=2024-11-10T22:44:35.146Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" Nov 10 22:44:40 materpc ollama[2103]: [GIN] 2024/11/10 - 22:44:40 | 200 | 5.540022ms | 192.168.1.55 | GET "/api/tags" Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloading 42 repeating layers to GPU Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloading non-repeating layers to GPU Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: offloaded 43/43 layers to GPU Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: CPU buffer size = 717.77 MiB Nov 10 22:44:43 materpc ollama[2103]: llm_load_tensors: CUDA0 buffer size = 5185.21 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_ctx = 8192 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_batch = 512 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: n_ubatch = 512 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: flash_attn = 1 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: freq_base = 10000.0 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: freq_scale = 1 Nov 10 22:44:44 materpc ollama[2103]: llama_kv_cache_init: CUDA0 KV buffer size = 2688.00 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: KV self size = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: CUDA_Host output buffer size = 3.96 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: CUDA0 compute buffer size = 507.00 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: CUDA_Host compute buffer size = 39.01 MiB Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: graph nodes = 1398 Nov 10 22:44:44 materpc ollama[2103]: llama_new_context_with_model: graph splits = 2 Nov 10 22:44:44 materpc ollama[129282]: INFO [main] model loaded | tid="129353907965952" timestamp=1731278684 Nov 10 22:44:44 materpc ollama[2103]: time=2024-11-10T22:44:44.418Z level=INFO source=server.go:626 msg="llama runner started in 9.52 seconds" Nov 10 22:44:44 materpc ollama[2103]: [GIN] 2024/11/10 - 22:44:44 | 200 | 9.718449438s | 127.0.0.1 | POST "/api/generate" ``` ``` (base) user@materpc:~$ ollama --version ollama version is 0.4.1 ----------- >>> 5+2/18=? Here's how to solve it: **1. Division:** 2 divided by 18 is 2/18, which simplifies to 1/9. **2. Addition:** 5 plus 1/9 is 5 + (1/9) = 46/9 **Answer:** 5 + 2/18 = 46/9 or approximately 5.11. total duration: 1.172501385s load duration: 24.592101ms prompt eval count: 289 token(s) prompt eval duration: 16ms prompt eval rate: 18062.50 tokens/s eval count: 91 token(s) eval duration: 1.129s eval rate: 80.60 tokens/s ------------ (base) user@materpc:~$ ollama ps NAME ID SIZE PROCESSOR UNTIL gemma2:latest ff02c3702f32 9.4 GB 100% GPU 4 minutes from now --------------- Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.080Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.139Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="59.9 GiB" free_swap="8.0 GiB" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.140Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.140Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1666998563/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --n-gpu-layers 43 --threads 16 --parallel 4 --port 42589" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.141Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=runner.go:863 msg="starting go runner" Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16 Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.167Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:42589" Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 0: general.architecture str = gemma2 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 1: general.name str = gemma-2-9b-it Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 4: gemma2.block_count u32 = 42 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 11: general.file_type u32 = 2 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type f32: 169 tensors Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type q4_0: 294 tensors Nov 10 22:57:46 materpc ollama[130240]: llama_model_loader: - type q6_K: 1 tensors Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: special tokens cache size = 108 Nov 10 22:57:46 materpc ollama[130240]: llm_load_vocab: token to piece cache size = 1.6014 MB Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: format = GGUF V3 (latest) Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: arch = gemma2 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: vocab type = SPM Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_vocab = 256000 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_merges = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: vocab_only = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ctx_train = 8192 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd = 3584 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_layer = 42 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_head = 16 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_head_kv = 8 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_rot = 256 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_swa = 4096 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_head_k = 256 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_head_v = 256 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_gqa = 2 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_k_gqa = 2048 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_embd_v_gqa = 2048 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_norm_eps = 0.0e+00 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: f_logit_scale = 0.0e+00 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ff = 14336 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_expert = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_expert_used = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: causal attn = 1 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: pooling type = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope type = 2 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope scaling = linear Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: freq_base_train = 10000.0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: freq_scale_train = 1 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: rope_finetuned = unknown Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_conv = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_inner = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_d_state = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_dt_rank = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model type = 9B Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model ftype = Q4_0 Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model params = 9.24 B Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: general.name = gemma-2-9b-it Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: BOS token = 2 '<bos>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOS token = 1 '<eos>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: UNK token = 3 '<unk>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: PAD token = 0 '<pad>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: LF token = 227 '<0x0A>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOG token = 1 '<eos>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: EOG token = 107 '<end_of_turn>' Nov 10 22:57:46 materpc ollama[130240]: llm_load_print_meta: max token length = 93 Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Nov 10 22:57:46 materpc ollama[130240]: ggml_cuda_init: found 1 CUDA devices: Nov 10 22:57:46 materpc ollama[130240]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Nov 10 22:57:46 materpc ollama[130240]: time=2024-11-10T22:57:46.392Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: ggml ctx size = 0.41 MiB Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloading 42 repeating layers to GPU Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloading non-repeating layers to GPU Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: offloaded 43/43 layers to GPU Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: CPU buffer size = 717.77 MiB Nov 10 22:57:46 materpc ollama[130240]: llm_load_tensors: CUDA0 buffer size = 5185.21 MiB Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_ctx = 8192 Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_batch = 2048 Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: n_ubatch = 512 Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: flash_attn = 0 Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: freq_base = 10000.0 Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: freq_scale = 1 Nov 10 22:57:46 materpc ollama[130240]: llama_kv_cache_init: CUDA0 KV buffer size = 2688.00 MiB Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: KV self size = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB Nov 10 22:57:46 materpc ollama[130240]: llama_new_context_with_model: CUDA_Host output buffer size = 3.96 MiB Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: CUDA0 compute buffer size = 507.00 MiB Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: CUDA_Host compute buffer size = 39.01 MiB Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: graph nodes = 1690 Nov 10 22:57:47 materpc ollama[130240]: llama_new_context_with_model: graph splits = 2 Nov 10 22:57:47 materpc ollama[130240]: time=2024-11-10T22:57:47.145Z level=INFO source=server.go:601 msg="llama runner started in 1.00 seconds" Nov 10 22:57:47 materpc ollama[130240]: [GIN] 2024/11/10 - 22:57:47 | 200 | 1.206791565s | 127.0.0.1 | POST "/api/generate" Nov 10 22:58:01 materpc ollama[130240]: [GIN] 2024/11/10 - 22:58:01 | 200 | 1.94717142s | 127.0.0.1 | POST "/api/chat" Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 0: general.architecture str = gemma2 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 1: general.name str = gemma-2-9b-it Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 4: gemma2.block_count u32 = 42 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 11: general.file_type u32 = 2 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type f32: 169 tensors Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type q4_0: 294 tensors Nov 10 22:58:07 materpc ollama[130240]: llama_model_loader: - type q6_K: 1 tensors Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: special tokens cache size = 108 Nov 10 22:58:07 materpc ollama[130240]: llm_load_vocab: token to piece cache size = 1.6014 MB Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: format = GGUF V3 (latest) Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: arch = gemma2 Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: vocab type = SPM Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: n_vocab = 256000 Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: n_merges = 0 Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: vocab_only = 1 Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model type = ?B Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model ftype = all F32 Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model params = 9.24 B Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: general.name = gemma-2-9b-it Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: BOS token = 2 '<bos>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOS token = 1 '<eos>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: UNK token = 3 '<unk>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: PAD token = 0 '<pad>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: LF token = 227 '<0x0A>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOG token = 1 '<eos>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: EOG token = 107 '<end_of_turn>' Nov 10 22:58:07 materpc ollama[130240]: llm_load_print_meta: max token length = 93 Nov 10 22:58:07 materpc ollama[130240]: llama_model_load: vocab only - skipping tensors Nov 10 22:58:08 materpc ollama[130240]: [GIN] 2024/11/10 - 22:58:08 | 200 | 1.325050573s | 127.0.0.1 | POST "/api/chat" ``` ``` 0.4.1 DEBUG ENABLED (base) user@materpc:~$ ollama run gemma2:latest --verbose >>> 5+2/18=? Here's how to solve the problem: **Order of Operations** Remember to follow the order of operations (PEMDAS/BODMAS): * **P**arentheses / **B**rackets * **E**xponents / **O**rders * **M**ultiplication and **D**ivision (from left to right) * **A**ddition and **S**ubtraction (from left to right) **Calculation** 1. **Division:** 2/18 = 1/9 2. **Addition:** 5 + 1/9 = 46/9 **Answer** 5 + 2/18 = 46/9 (or approximately 5.11) total duration: 2.306254951s load duration: 28.086045ms prompt eval count: 16 token(s) prompt eval duration: 12ms prompt eval rate: 1333.33 tokens/s eval count: 163 token(s) eval duration: 2.265s eval rate: 71.96 tokens/s ----------------- --------- (base) user@materpc:~$ journalctl -u ollama --no-pager -n300 Nov 10 23:08:32 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20 Nov 10 23:08:32 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850 Nov 10 23:08:32 materpc ollama[131600]: calling cuInit Nov 10 23:08:32 materpc ollama[131600]: calling cuDriverGetVersion Nov 10 23:08:32 materpc ollama[131600]: raw version 0x2f08 Nov 10 23:08:32 materpc ollama[131600]: CUDA driver version: 12.4 Nov 10 23:08:32 materpc ollama[131600]: calling cuDeviceGetCount Nov 10 23:08:32 materpc ollama[131600]: device count 1 Nov 10 23:08:32 materpc ollama[131600]: time=2024-11-10T23:08:32.858Z level=DEBUG source=gpu.go:129 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.120 Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] CUDA totalMem 16071 mb Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] CUDA freeMem 15857 mb Nov 10 23:08:32 materpc ollama[131600]: [GPU-a371085e-f395-5afa-80a4-9d858159f7d6] Compute Capability 8.9 Nov 10 23:08:33 materpc ollama[131600]: time=2024-11-10T23:08:33.016Z level=DEBUG source=amd_linux.go:416 msg="amdgpu driver not detected /sys/module/amdgpu" Nov 10 23:08:33 materpc ollama[131600]: releasing cuda driver library Nov 10 23:08:33 materpc ollama[131600]: time=2024-11-10T23:08:33.016Z level=INFO source=types.go:123 msg="inference compute" id=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4070 Ti SUPER" total="15.7 GiB" available="15.5 GiB" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.429Z level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="62.4 GiB" before.free="60.2 GiB" before.free_swap="8.0 GiB" now.total="62.4 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" Nov 10 23:08:39 materpc ollama[131600]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.120 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuInit - 0x77d44c87cbc0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDriverGetVersion - 0x77d44c87cbe0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetCount - 0x77d44c87cc20 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGet - 0x77d44c87cc00 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetAttribute - 0x77d44c87cd00 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetUuid - 0x77d44c87cc60 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetName - 0x77d44c87cc40 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxCreate_v3 - 0x77d44c87cee0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850 Nov 10 23:08:39 materpc ollama[131600]: calling cuInit Nov 10 23:08:39 materpc ollama[131600]: calling cuDriverGetVersion Nov 10 23:08:39 materpc ollama[131600]: raw version 0x2f08 Nov 10 23:08:39 materpc ollama[131600]: CUDA driver version: 12.4 Nov 10 23:08:39 materpc ollama[131600]: calling cuDeviceGetCount Nov 10 23:08:39 materpc ollama[131600]: device count 1 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.508Z level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 name="NVIDIA GeForce RTX 4070 Ti SUPER" overhead="0 B" before.total="15.7 GiB" before.free="15.5 GiB" now.total="15.7 GiB" now.free="15.5 GiB" now.used="214.6 MiB" Nov 10 23:08:39 materpc ollama[131600]: releasing cuda driver library Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.508Z level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x811140 gpu_count=1 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.547Z level=DEBUG source=sched.go:224 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.547Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.5 GiB]" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.548Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 parallel=4 available=16627466240 required="8.8 GiB" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.548Z level=DEBUG source=gpu.go:398 msg="updating system memory data" before.total="62.4 GiB" before.free="60.1 GiB" before.free_swap="8.0 GiB" now.total="62.4 GiB" now.free="60.1 GiB" now.free_swap="8.0 GiB" Nov 10 23:08:39 materpc ollama[131600]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.550.120 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuInit - 0x77d44c87cbc0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDriverGetVersion - 0x77d44c87cbe0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetCount - 0x77d44c87cc20 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGet - 0x77d44c87cc00 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetAttribute - 0x77d44c87cd00 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetUuid - 0x77d44c87cc60 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuDeviceGetName - 0x77d44c87cc40 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxCreate_v3 - 0x77d44c87cee0 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuMemGetInfo_v2 - 0x77d44c886e20 Nov 10 23:08:39 materpc ollama[131600]: dlsym: cuCtxDestroy - 0x77d44c8e1850 Nov 10 23:08:39 materpc ollama[131600]: calling cuInit Nov 10 23:08:39 materpc ollama[131600]: calling cuDriverGetVersion Nov 10 23:08:39 materpc ollama[131600]: raw version 0x2f08 Nov 10 23:08:39 materpc ollama[131600]: CUDA driver version: 12.4 Nov 10 23:08:39 materpc ollama[131600]: calling cuDeviceGetCount Nov 10 23:08:39 materpc ollama[131600]: device count 1 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.610Z level=DEBUG source=gpu.go:448 msg="updating cuda memory data" gpu=GPU-a371085e-f395-5afa-80a4-9d858159f7d6 name="NVIDIA GeForce RTX 4070 Ti SUPER" overhead="0 B" before.total="15.7 GiB" before.free="15.5 GiB" now.total="15.7 GiB" now.free="15.5 GiB" now.used="214.6 MiB" Nov 10 23:08:39 materpc ollama[131600]: releasing cuda driver library Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=INFO source=server.go:105 msg="system memory" total="62.4 GiB" free="60.1 GiB" free_swap="8.0 GiB" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=memory.go:107 msg=evaluating library=cuda gpu_count=1 available="[15.5 GiB]" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=43 layers.split="" memory.available="[15.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.8 GiB" memory.required.partial="8.8 GiB" memory.required.kv="2.6 GiB" memory.required.allocations="[8.8 GiB]" memory.weights.total="7.0 GiB" memory.weights.repeating="6.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu payload=linux/amd64/cpu/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx payload=linux/amd64/cpu_avx/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cpu_avx2 payload=linux/amd64/cpu_avx2/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v11 payload=linux/amd64/cuda_v11/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=cuda_v12 payload=linux/amd64/cuda_v12/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:168 msg=extracting runner=rocm payload=linux/amd64/rocm/ollama_llama_server.gz Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx2/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.611Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v11/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/rocm/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cpu_avx2/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v11/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=common.go:294 msg="availableServers : found" file=/tmp/ollama3132615929/runners/rocm/ollama_llama_server Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama3132615929/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 8192 --batch-size 512 --n-gpu-layers 43 --verbose --threads 16 --flash-attn --parallel 4 --port 35173" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=DEBUG source=server.go:400 msg=subprocess environment="[PATH=/home/user/.local/bin:/home/user/miniconda3/bin:/home/user/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/user/.local/bin:/home/user/.local/bin LD_LIBRARY_PATH=/usr/local/lib/ollama:/tmp/ollama3132615929/runners/cuda_v12 CUDA_VISIBLE_DEVICES=GPU-a371085e-f395-5afa-80a4-9d858159f7d6]" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.612Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.638Z level=INFO source=runner.go:863 msg="starting go runner" Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.639Z level=INFO source=runner.go:864 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=16 Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.639Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:35173" Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 0: general.architecture str = gemma2 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 1: general.name str = gemma-2-9b-it Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 4: gemma2.block_count u32 = 42 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 11: general.file_type u32 = 2 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type f32: 169 tensors Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type q4_0: 294 tensors Nov 10 23:08:39 materpc ollama[131600]: llama_model_loader: - type q6_K: 1 tensors Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: special tokens cache size = 108 Nov 10 23:08:39 materpc ollama[131600]: llm_load_vocab: token to piece cache size = 1.6014 MB Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: format = GGUF V3 (latest) Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: arch = gemma2 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: vocab type = SPM Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_vocab = 256000 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_merges = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: vocab_only = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ctx_train = 8192 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd = 3584 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_layer = 42 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_head = 16 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_head_kv = 8 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_rot = 256 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_swa = 4096 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_head_k = 256 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_head_v = 256 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_gqa = 2 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_k_gqa = 2048 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_embd_v_gqa = 2048 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_norm_eps = 0.0e+00 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: f_logit_scale = 0.0e+00 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ff = 14336 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_expert = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_expert_used = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: causal attn = 1 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: pooling type = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope type = 2 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope scaling = linear Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: freq_base_train = 10000.0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: freq_scale_train = 1 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: n_ctx_orig_yarn = 8192 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: rope_finetuned = unknown Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_conv = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_inner = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_d_state = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_dt_rank = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model type = 9B Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model ftype = Q4_0 Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model params = 9.24 B Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: general.name = gemma-2-9b-it Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: BOS token = 2 '<bos>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOS token = 1 '<eos>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: UNK token = 3 '<unk>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: PAD token = 0 '<pad>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: LF token = 227 '<0x0A>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOG token = 1 '<eos>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: EOG token = 107 '<end_of_turn>' Nov 10 23:08:39 materpc ollama[131600]: llm_load_print_meta: max token length = 93 Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Nov 10 23:08:39 materpc ollama[131600]: ggml_cuda_init: found 1 CUDA devices: Nov 10 23:08:39 materpc ollama[131600]: Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes Nov 10 23:08:39 materpc ollama[131600]: time=2024-11-10T23:08:39.863Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" Nov 10 23:08:39 materpc ollama[131600]: llm_load_tensors: ggml ctx size = 0.41 MiB Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloading 42 repeating layers to GPU Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloading non-repeating layers to GPU Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: offloaded 43/43 layers to GPU Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: CPU buffer size = 717.77 MiB Nov 10 23:08:40 materpc ollama[131600]: llm_load_tensors: CUDA0 buffer size = 5185.21 MiB Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.114Z level=DEBUG source=server.go:607 msg="model load progress 0.12" Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.365Z level=DEBUG source=server.go:607 msg="model load progress 1.00" Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_ctx = 8192 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_batch = 2048 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: n_ubatch = 512 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: flash_attn = 1 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: freq_base = 10000.0 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: freq_scale = 1 Nov 10 23:08:40 materpc ollama[131600]: llama_kv_cache_init: CUDA0 KV buffer size = 2688.00 MiB Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: KV self size = 2688.00 MiB, K (f16): 1344.00 MiB, V (f16): 1344.00 MiB Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: CUDA_Host output buffer size = 3.96 MiB Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: CUDA0 compute buffer size = 507.00 MiB Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: CUDA_Host compute buffer size = 39.01 MiB Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: graph nodes = 1398 Nov 10 23:08:40 materpc ollama[131600]: llama_new_context_with_model: graph splits = 2 Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.615Z level=INFO source=server.go:601 msg="llama runner started in 1.00 seconds" Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.615Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.616Z level=DEBUG source=server.go:955 msg="new runner detected, loading model for cgo tokenization" Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 0: general.architecture str = gemma2 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 1: general.name str = gemma-2-9b-it Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 3: gemma2.embedding_length u32 = 3584 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 4: gemma2.block_count u32 = 42 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 14336 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 16 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 8 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 256 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 256 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 11: general.file_type u32 = 2 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 15: tokenizer.ggml.model str = llama Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = default Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - kv 28: general.quantization_version u32 = 2 Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type f32: 169 tensors Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type q4_0: 294 tensors Nov 10 23:08:40 materpc ollama[131600]: llama_model_loader: - type q6_K: 1 tensors Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: special tokens cache size = 108 Nov 10 23:08:40 materpc ollama[131600]: llm_load_vocab: token to piece cache size = 1.6014 MB Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: format = GGUF V3 (latest) Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: arch = gemma2 Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: vocab type = SPM Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: n_vocab = 256000 Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: n_merges = 0 Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: vocab_only = 1 Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model type = ?B Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model ftype = all F32 Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model params = 9.24 B Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: model size = 5.06 GiB (4.71 BPW) Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: general.name = gemma-2-9b-it Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: BOS token = 2 '<bos>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOS token = 1 '<eos>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: UNK token = 3 '<unk>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: PAD token = 0 '<pad>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: LF token = 227 '<0x0A>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOT token = 107 '<end_of_turn>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOG token = 1 '<eos>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: EOG token = 107 '<end_of_turn>' Nov 10 23:08:40 materpc ollama[131600]: llm_load_print_meta: max token length = 93 Nov 10 23:08:40 materpc ollama[131600]: llama_model_load: vocab only - skipping tensors Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.812Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve the problem:\n\n**Remember the order of operations (PEMDAS/BODMAS):**\n\n* **P**arentheses / **B**rackets\n* **E**xponents / **O**rders\n* **M**ultiplication and **D**ivision (from left to right)\n* **A**ddition and **S**ubtraction (from left to right)\n\n1. **Division:** 2/18 = 0.1111 (approximately)\n2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**Remember PEMDAS:** Parentheses, Exponents, Multiplication and Division (left to right), Addition and Subtraction (left to right).\n\n1. **Division:** 2 / 18 = 0.1111 (approximately)\n2. **Addition:** 5 + 0.1111 = 5.1111 (approximately)\n\n\nSo, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**Remember the order of operations (PEMDAS):**\n\n* **P**arentheses / **B**rackets\n* **E**xponents / **O**rders\n* **M**ultiplication and **D**ivision (from left to right)\n* **A**ddition and **S**ubtraction (from left to right)\n\n1. **Division:** 2 divided by 18 is 0.1111 (approximately)\n2. **Addition:** 5 plus 0.1111 is 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\nHere's how to solve it:\n\n**1. Division:** 2 divided by 18 is 0.1111 (approximately)\n\n**2. Addition:** 5 plus 0.1111 is 5.1111 (approximately)\n\n\nTherefore, 5 + 2/18 ≈ **5.1111**<end_of_turn>\n<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\n" Nov 10 23:08:40 materpc ollama[131600]: time=2024-11-10T23:08:40.813Z level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=0 prompt=582 used=0 remaining=582 Nov 10 23:08:42 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:42 | 200 | 5.703167ms | 192.168.1.55 | GET "/api/tags" Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 | 3.817021475s | 127.0.0.1 | POST "/api/chat" Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:466 msg="context for request finished" Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s Nov 10 23:08:43 materpc ollama[131600]: time=2024-11-10T23:08:43.221Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 | 16.59µs | 127.0.0.1 | HEAD "/" Nov 10 23:08:43 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:43 | 200 | 64.469µs | 127.0.0.1 | GET "/api/ps" Nov 10 23:08:58 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:58 | 200 | 5.268749ms | 192.168.1.55 | GET "/api/tags" Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 | 18.65µs | 127.0.0.1 | HEAD "/" Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 | 15.703358ms | 127.0.0.1 | POST "/api/show" Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.309Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 Nov 10 23:08:59 materpc ollama[131600]: [GIN] 2024/11/10 - 23:08:59 | 200 | 20.230051ms | 127.0.0.1 | POST "/api/generate" Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:407 msg="context for request finished" Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s Nov 10 23:08:59 materpc ollama[131600]: time=2024-11-10T23:08:59.310Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<start_of_turn>user\n5+2/18=?<end_of_turn>\n<start_of_turn>model\n" Nov 10 23:09:00 materpc ollama[131600]: time=2024-11-10T23:09:00.727Z level=DEBUG source=cache.go:99 msg="loading cache slot" id=0 cache=737 prompt=16 used=15 remaining=1 Nov 10 23:09:03 materpc ollama[131600]: [GIN] 2024/11/10 - 23:09:03 | 200 | 2.306287181s | 127.0.0.1 | POST "/api/chat" Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:407 msg="context for request finished" Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=5m0s Nov 10 23:09:03 materpc ollama[131600]: time=2024-11-10T23:09:03.005Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 Nov 10 23:09:14 materpc ollama[131600]: [GIN] 2024/11/10 - 23:09:14 | 200 | 4.720042ms | 192.168.1.55 | GET "/api/tags" ```
Author
Owner

@rick-github commented on GitHub (Nov 11, 2024):

Note that the 0.4 series has switched to the new go runner, and there's going to be some tuning along the way. From some testing, it looks like the switch has a 1-2% performance decrease, depending on model and hardware.

Interesting bits from the logs:

  llama_new_context_with_model: n_ctx      = 8192
- llama_new_context_with_model: n_batch    = 512
+ llama_new_context_with_model: n_batch    = 2048
  llama_new_context_with_model: n_ubatch   = 512
- llama_new_context_with_model: flash_attn = 1
+ llama_new_context_with_model: flash_attn = 0
  llama_new_context_with_model: freq_base  = 10000.0
  llama_new_context_with_model: freq_scale = 1
  llama_kv_cache_init:      CUDA0 KV buffer size =  2688.00 MiB
@@ -122,8 +120,67 @@
  llama_new_context_with_model:  CUDA_Host  output buffer size =     3.96 MiB
  llama_new_context_with_model:      CUDA0 compute buffer size =   507.00 MiB
  llama_new_context_with_model:  CUDA_Host compute buffer size =    39.01 MiB
- llama_new_context_with_model: graph nodes  = 1398
+ llama_new_context_with_model: graph nodes  = 1690

The go runner scales n_batch by OLLAMA_NUM_PARALLEL, which the c++ runner didn't. Flash attention is enabled for 0.3.14 but not 0.4.1, it's unclear why. The value of graph nodes is affected by OLLAMA_FLASH_ATTENTION and has an effect on the graph computation in the runner.

Testing shows that flash attention increases tps about 1%, it would be interesting to see how enabling FA affects your performance with 0.4.1.

Apart from this, nothing really sticks out as a cause for the 6% reduction performance you are seeing. If you include the full log, which contains information about configuration, GPU detection and memory allocation, there might be some insight.

<!-- gh-comment-id:2468041208 --> @rick-github commented on GitHub (Nov 11, 2024): Note that the 0.4 series has switched to the new go runner, and there's going to be some tuning along the way. From some testing, it looks like the switch has a 1-2% performance decrease, depending on model and hardware. Interesting bits from the logs: ``` llama_new_context_with_model: n_ctx = 8192 - llama_new_context_with_model: n_batch = 512 + llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 - llama_new_context_with_model: flash_attn = 1 + llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 2688.00 MiB @@ -122,8 +120,67 @@ llama_new_context_with_model: CUDA_Host output buffer size = 3.96 MiB llama_new_context_with_model: CUDA0 compute buffer size = 507.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 39.01 MiB - llama_new_context_with_model: graph nodes = 1398 + llama_new_context_with_model: graph nodes = 1690 ``` The go runner scales `n_batch` by `OLLAMA_NUM_PARALLEL`, which the c++ runner didn't. Flash attention is enabled for 0.3.14 but not 0.4.1, it's unclear why. The value of `graph nodes` is affected by `OLLAMA_FLASH_ATTENTION` and has an effect on the graph computation in the runner. Testing shows that flash attention increases tps about 1%, it would be interesting to see how enabling FA affects your performance with 0.4.1. Apart from this, nothing really sticks out as a cause for the 6% reduction performance you are seeing. If you include the full log, which contains information about configuration, GPU detection and memory allocation, there might be some insight.
Author
Owner

@MMaturax commented on GitHub (Nov 11, 2024):

I ran the 0.4.1 test twice, once with OLLAMA_FLASH_ATTENTION enabled and once disabled. With OLLAMA_FLASH_ATTENTION enabled, I achieved around 71 tokens/second, whereas with it disabled, I reached approximately 80 tokens/second.

In your tests, the difference is almost negligible, but in mine, version 0.3.14 never drops below 85 tokens, while version 0.4.x seems unable to exceed 80 tokens. Could this difference be due to drivers, CUDA versions, or something similar?

image

image

<!-- gh-comment-id:2468341454 --> @MMaturax commented on GitHub (Nov 11, 2024): I ran the 0.4.1 test twice, once with OLLAMA_FLASH_ATTENTION enabled and once disabled. With OLLAMA_FLASH_ATTENTION enabled, I achieved around 71 tokens/second, whereas with it disabled, I reached approximately 80 tokens/second. In your tests, the difference is almost negligible, but in mine, version 0.3.14 never drops below 85 tokens, while version 0.4.x seems unable to exceed 80 tokens. Could this difference be due to drivers, CUDA versions, or something similar? ![image](https://github.com/user-attachments/assets/8fe1fdee-80f6-4357-92ea-00da1dcfb882) ![image](https://github.com/user-attachments/assets/f8b3fa0f-a5f1-439e-9bf4-041b8399176a)
Author
Owner

@rick-github commented on GitHub (Nov 11, 2024):

Performance of 0.4.1 went down when OLLAMA_FLASH_ATTENTION=1? Can you post logs?

<!-- gh-comment-id:2468357687 --> @rick-github commented on GitHub (Nov 11, 2024): Performance of 0.4.1 went down when `OLLAMA_FLASH_ATTENTION=1`? Can you post logs?
Author
Owner

@MMaturax commented on GitHub (Nov 12, 2024):

In "0.4.1 DEBUG ENABLED," FLASH_ATTENTION was enabled.

<!-- gh-comment-id:2469421821 --> @MMaturax commented on GitHub (Nov 12, 2024): In "0.4.1 DEBUG ENABLED," FLASH_ATTENTION was enabled.
Author
Owner

@jessegross commented on GitHub (Nov 20, 2024):

For those that are experiencing this issue and are able to build from source, there was a recent performance-impacting bug fix that went into main. Please try it and out report result back here if possible.

<!-- gh-comment-id:2487009618 --> @jessegross commented on GitHub (Nov 20, 2024): For those that are experiencing this issue and are able to build from source, there was a recent performance-impacting bug fix that went into `main`. Please try it and out report result back here if possible.
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Version gemma2 llama3.1 llama3.2:3b
0.3.13 61.63±0.04 78.85±0.13 144.98±0.61
0.3.14 62.33±0.28 79.40±0.13 148.07±0.26
0.4.0 61.30±0.46 79.03±0.23 141.35±1.21
0.4.1 61.54±0.08 79.00±0.20 142.24±0.19
0.4.2 61.50±0.38 78.89±0.14 144.74±1.32
0.4.3-pre 62.21±0.29 79.54±0.12 148.12±0.21
<!-- gh-comment-id:2487225349 --> @rick-github commented on GitHub (Nov 20, 2024): | Version | gemma2 | llama3.1 | llama3.2:3b | |---------|-----------------------------------|---------------------------------------|---------------------------------------| | 0.3.13 | 61.63±0.04 | 78.85±0.13 | 144.98±0.61 | | 0.3.14 | 62.33±0.28 | 79.40±0.13 | 148.07±0.26 | | 0.4.0 | 61.30±0.46 | 79.03±0.23 | 141.35±1.21 | | 0.4.1 | 61.54±0.08 | 79.00±0.20 | 142.24±0.19 | | 0.4.2 | 61.50±0.38 | 78.89±0.14 | 144.74±1.32 | | 0.4.3-pre | 62.21±0.29 | 79.54±0.12 | 148.12±0.21 |
Author
Owner

@MMaturax commented on GitHub (Nov 22, 2024):

For those that are experiencing this issue and are able to build from source, there was a recent performance-impacting bug fix that went into main. Please try it and out report result back here if possible.

I’ve conducted various tests on version 0.4.3, and its performance now appears to be the same as 0.3.14. The issue seems to be resolved. Great job, well done!

<!-- gh-comment-id:2492858018 --> @MMaturax commented on GitHub (Nov 22, 2024): > For those that are experiencing this issue and are able to build from source, there was a recent performance-impacting bug fix that went into `main`. Please try it and out report result back here if possible. I’ve conducted various tests on version 0.4.3, and its performance now appears to be the same as 0.3.14. The issue seems to be resolved. Great job, well done!
Author
Owner

@jessegross commented on GitHub (Nov 22, 2024):

Thanks for testing!

<!-- gh-comment-id:2494645246 --> @jessegross commented on GitHub (Nov 22, 2024): Thanks for testing!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51302