[GH-ISSUE #9808] How to set the --cache-type-k, --threads, and --prio parameters for the llama-cli command in Ollama? #52929

Closed
opened 2026-04-29 01:25:01 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @ssdy5366228 on GitHub (Mar 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9808

What is the issue?

I would greatly appreciate any suggestions and answers.

llama-cli --model /data02/AI_TM/models/models_WF/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M.gguf --cache-type-k q4_0 --threads 48 --n-gpu-layers 12 --temp 0.6 --ctx-size 8192 --min-p 0.05 --batch-size 512 --prio 2 --prompt "<|User|>Hello<|Assistant|>"
When I run the above command, I can have a normal conversation with the model.

I wrote some of the parameters into the Modelfile and generated a new model file. The Modelfile content and command are as follows:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM deepseek-r1:671b

FROM /data02/AI_TM/models/models_WF/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M.gguf
# TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
# {{- range $i, $_ := .Messages }}
# {{- $last := eq (len (slice $.Messages $i)) 1}}
# {{- if eq .Role "user" }}<|User|>{{ .Content }}
# {{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
# {{- end }}
# {{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }}
# {{- end }}"""
PARAMETER stop <|begin▁of▁sentence|>
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
LICENSE """MIT License
#PARAMETER cache-type-k q4_0
#PARAMETER 
PARAMETER num_gpu 12
PARAMETER num_ctx 8192
PARAMETER temperature 0.6
PARAMETER min_p 0.05
TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"

Copyright (c) 2023 DeepSeek

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""
$ ollama create DeepSeek-R1-Q4_K_M -f ./myR1gguf_Modelfile
gathering model components 
copying file sha256:79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a 100% 
parsing GGUF 
using existing layer sha256:79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a 
creating new layer sha256:6bb63a0e1db51222ec5a52f8754e69476b73c7a0daf7be346039c2d933b0b9bf 
using existing layer sha256:f4d24e9138dd4603380add165d2b0d970bef471fac194b436ebd50e6147c6588 
writing manifest 
success

My ollama.service content is as follows:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/bin:/software/proj-6.2.1/build/bin:/software/proj-9.2.1/build/bin:/usr/local/sqlite3/bin:/opt/rh/devtoolset-9/root/usr/bin:/data01/home/weifeng/.aspera/connect/bin:/software/anaconda3/bin:/data01/home/weifeng/software/biosoft/miniconda3/condabin:/opt/rh/devtoolset-8/root/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/software/R-4.2.3/bin:/data01/home/weifeng/.local/bin:/data01/home/weifeng/bin"
Environment="OLLAMA_MODELS=/data02/AI_TM/models"
Environment="CUDA_VISIBLE_DEVICES=0,1,2"
Environment="OLLAMA_GPU_OVERHEAD=2147483648"
Environment="OLLAMA_FLASH_ATTENTION=0"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"

[Install]
WantedBy=multi-user.target
$ ollama run DeepSeek-R1-Q4_K_M:latest
>>> 1+1=?
=>2.>@37?@)?/@67>#>+)E=C+
  1. Why is the response garbled when I run the above command? Did I make a mistake in my parameter settings?
  2. How to set the --cache-type-k, --threads, and --prio parameters for the llama-cli command in Ollama?
  3. I have already set parameters like num_gpu and num_ctx in my Modelfile. When running the model file generated with this Modelfile, why are the parameters I see different from those in the Modelfile?

Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.12

Originally created by @ssdy5366228 on GitHub (Mar 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9808 ### What is the issue? **I would greatly appreciate any suggestions and answers.** `llama-cli --model /data02/AI_TM/models/models_WF/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M.gguf --cache-type-k q4_0 --threads 48 --n-gpu-layers 12 --temp 0.6 --ctx-size 8192 --min-p 0.05 --batch-size 512 --prio 2 --prompt "<|User|>Hello<|Assistant|>"` When I run the above command, I can have a normal conversation with the model. I wrote some of the parameters into the Modelfile and generated a new model file. The Modelfile content and command are as follows: ``` # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM deepseek-r1:671b FROM /data02/AI_TM/models/models_WF/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M.gguf # TEMPLATE """{{- if .System }}{{ .System }}{{ end }} # {{- range $i, $_ := .Messages }} # {{- $last := eq (len (slice $.Messages $i)) 1}} # {{- if eq .Role "user" }}<|User|>{{ .Content }} # {{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} # {{- end }} # {{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }} # {{- end }}""" PARAMETER stop <|begin▁of▁sentence|> PARAMETER stop <|end▁of▁sentence|> PARAMETER stop <|User|> PARAMETER stop <|Assistant|> LICENSE """MIT License #PARAMETER cache-type-k q4_0 #PARAMETER PARAMETER num_gpu 12 PARAMETER num_ctx 8192 PARAMETER temperature 0.6 PARAMETER min_p 0.05 TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" Copyright (c) 2023 DeepSeek Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ ``` ``` $ ollama create DeepSeek-R1-Q4_K_M -f ./myR1gguf_Modelfile gathering model components copying file sha256:79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a 100% parsing GGUF using existing layer sha256:79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a creating new layer sha256:6bb63a0e1db51222ec5a52f8754e69476b73c7a0daf7be346039c2d933b0b9bf using existing layer sha256:f4d24e9138dd4603380add165d2b0d970bef471fac194b436ebd50e6147c6588 writing manifest success ``` My ollama.service content is as follows: ``` [Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/bin:/software/proj-6.2.1/build/bin:/software/proj-9.2.1/build/bin:/usr/local/sqlite3/bin:/opt/rh/devtoolset-9/root/usr/bin:/data01/home/weifeng/.aspera/connect/bin:/software/anaconda3/bin:/data01/home/weifeng/software/biosoft/miniconda3/condabin:/opt/rh/devtoolset-8/root/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/software/R-4.2.3/bin:/data01/home/weifeng/.local/bin:/data01/home/weifeng/bin" Environment="OLLAMA_MODELS=/data02/AI_TM/models" Environment="CUDA_VISIBLE_DEVICES=0,1,2" Environment="OLLAMA_GPU_OVERHEAD=2147483648" Environment="OLLAMA_FLASH_ATTENTION=0" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_MAX_LOADED_MODELS=4" Environment="OLLAMA_KV_CACHE_TYPE=q4_0" [Install] WantedBy=multi-user.target ``` ``` $ ollama run DeepSeek-R1-Q4_K_M:latest >>> 1+1=? =>2.>@37?@)?/@67>#>+)E=C+ ``` 1. Why is the response garbled when I run the above command? Did I make a mistake in my parameter settings? 2. How to set the --cache-type-k, --threads, and --prio parameters for the llama-cli command in Ollama? 3. I have already set parameters like num_gpu and num_ctx in my Modelfile. When running the model file generated with this Modelfile, why are the parameters I see different from those in the Modelfile? ![Image](https://github.com/user-attachments/assets/9efe4495-0c4d-4258-9bbe-21a506426c70) ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.12
GiteaMirror added the bug label 2026-04-29 01:25:01 -05:00
Author
Owner
<!-- gh-comment-id:2728455613 --> @ag2s20150909 commented on GitHub (Mar 17, 2025): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache https://github.com/ollama/ollama/pull/7983
Author
Owner

@rick-github commented on GitHub (Mar 17, 2025):

--- myR1gguf_Modelfile.orig	2025-03-17 10:32:26.752810786 +0100
+++ myR1gguf_Modelfile	2025-03-17 10:33:15.158116284 +0100
@@ -15,7 +15,6 @@
 PARAMETER stop <|end▁of▁sentence|>
 PARAMETER stop <|User|>
 PARAMETER stop <|Assistant|>
-LICENSE """MIT License
 #PARAMETER cache-type-k q4_0
 #PARAMETER 
 PARAMETER num_gpu 12
@@ -24,6 +23,7 @@
 PARAMETER min_p 0.05
 TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"
 
+LICENSE """MIT License
 Copyright (c) 2023 DeepSeek
 
 Permission is hereby granted, free of charge, to any person obtaining a copy

--threads can be set with num_thread in the API call or Modelfile. --prio is not supported by the ollama runner. In order for OLLAMA_KV_CACHE_TYPE to take effect you also need to set OLLAMA_FLASH_ATTENTION=1. However, some models don't support FA at the moment.

<!-- gh-comment-id:2728811775 --> @rick-github commented on GitHub (Mar 17, 2025): ```diff --- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100 +++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100 @@ -15,7 +15,6 @@ PARAMETER stop <|end▁of▁sentence|> PARAMETER stop <|User|> PARAMETER stop <|Assistant|> -LICENSE """MIT License #PARAMETER cache-type-k q4_0 #PARAMETER PARAMETER num_gpu 12 @@ -24,6 +23,7 @@ PARAMETER min_p 0.05 TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" +LICENSE """MIT License Copyright (c) 2023 DeepSeek Permission is hereby granted, free of charge, to any person obtaining a copy ``` --threads can be set with `num_thread` in the API call or Modelfile. --prio is not supported by the ollama runner. In order for `OLLAMA_KV_CACHE_TYPE` to take effect you also need to set `OLLAMA_FLASH_ATTENTION=1`. However, some models don't support FA at the moment.
Author
Owner
<!-- gh-comment-id:2731249469 --> @ssdy5366228 commented on GitHub (Mar 18, 2025): > https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache > > [#7983](https://github.com/ollama/ollama/pull/7983) Thanks a lot
Author
Owner

@ssdy5366228 commented on GitHub (Mar 18, 2025):

--- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100
+++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100
@@ -15,7 +15,6 @@
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
-LICENSE """MIT License
#PARAMETER cache-type-k q4_0
#PARAMETER
PARAMETER num_gpu 12
@@ -24,6 +23,7 @@
PARAMETER min_p 0.05
TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"

+LICENSE """MIT License
Copyright (c) 2023 DeepSeek

Permission is hereby granted, free of charge, to any person obtaining a copy
--threads can be set with num_thread in the API call or Modelfile. --prio is not supported by the ollama runner. In order for OLLAMA_KV_CACHE_TYPE to take effect you also need to set OLLAMA_FLASH_ATTENTION=1. However, some models don't support FA at the moment.

Thanks a lot! I am gonna give your suggestions a shot right now.

<!-- gh-comment-id:2731257312 --> @ssdy5366228 commented on GitHub (Mar 18, 2025): > --- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100 > +++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100 > @@ -15,7 +15,6 @@ > PARAMETER stop <|end▁of▁sentence|> > PARAMETER stop <|User|> > PARAMETER stop <|Assistant|> > -LICENSE """MIT License > #PARAMETER cache-type-k q4_0 > #PARAMETER > PARAMETER num_gpu 12 > @@ -24,6 +23,7 @@ > PARAMETER min_p 0.05 > TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" > > +LICENSE """MIT License > Copyright (c) 2023 DeepSeek > > Permission is hereby granted, free of charge, to any person obtaining a copy > --threads can be set with `num_thread` in the API call or Modelfile. --prio is not supported by the ollama runner. In order for `OLLAMA_KV_CACHE_TYPE` to take effect you also need to set `OLLAMA_FLASH_ATTENTION=1`. However, some models don't support FA at the moment. Thanks a lot! I am gonna give your suggestions a shot right now.
Author
Owner

@ssdy5366228 commented on GitHub (Mar 18, 2025):

--- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100
+++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100
@@ -15,7 +15,6 @@
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
-LICENSE """MIT License
#PARAMETER cache-type-k q4_0
#PARAMETER
PARAMETER num_gpu 12
@@ -24,6 +23,7 @@
PARAMETER min_p 0.05
TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"

+LICENSE """MIT License
Copyright (c) 2023 DeepSeek

Permission is hereby granted, free of charge, to any person obtaining a copy
--threads can be set with num_thread in the API call or Modelfile. --prio is not supported by the ollama runner. In order for OLLAMA_KV_CACHE_TYPE to take effect you also need to set OLLAMA_FLASH_ATTENTION=1. However, some models don't support FA at the moment.

PARAMETER num_thread 48
I added the parameters to the Modelfile and rebuilt the model file.

Environment="OLLAMA_MODELS=/data02/AI_TM/models"
Environment="CUDA_VISIBLE_DEVICES=0,1,2"
Environment="OLLAMA_GPU_OVERHEAD=2147483648"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_THREADS=48"

and I set OLLAMA_FLASH_ATTENTION=1 in my ollama.service

$ ollama run DeepSeek-R1-Q4_K_M:latest
>>> 1+1=?
"5.BE&D;CB1&CC.<|▁pad▁|>5<|▁pad▁|>81&/D^C

>>>

The response is still garbled.
The nvitop monitoring is as shown in the image, the parameters --ctx-size, --n-gpu-layers, --threads are still different from the preset ones.
Image

Is there still an error in my parameter settings?

Mar 18 09:42:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:42:22 | 200 |      62.186µs |       127.0.0.1 | HEAD     "/"
Mar 18 09:42:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:42:22 | 200 |   32.520952ms |       127.0.0.1 | POST     "/api/show"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.413+08:00 level=INFO source=server.go:97 msg="system memory" total="1007.6 GiB" free="967.7 GiB" free_swap="30.7 GiB"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.912+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=62 layers.offload=20 layers.split=6,7,7 memory.available="[44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="411.4 GiB" memory.required.partial="119.8 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[37.9 GiB 41.4 GiB 40.5 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="1019.5 MiB" memory.graph.partial="1019.5 MiB"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=WARN source=server.go:175 msg="flash attention enabled but not supported by model"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=WARN source=server.go:193 msg="quantized kv cache requested but flash attention disabled" type=q4_0
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a --ctx-size 2048 --batch-size 512 --n-gpu-layers 20 --threads 128 --parallel 1 --tensor-split 6,7,7 --port 35760"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.939+08:00 level=INFO source=runner.go:932 msg="starting go runner"
Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: found 3 CUDA devices:
Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 0: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 1: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 2: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 09:42:26 Translational-Medicine ollama[215328]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 18 09:42:26 Translational-Medicine ollama[215328]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.245+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.245+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:35760"
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA0 (NVIDIA A40) - 45060 MiB free
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA1 (NVIDIA A40) - 45046 MiB free
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA2 (NVIDIA A40) - 45046 MiB free
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a (version GGUF V3 (latest))
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   1:                               general.type str              = model
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.419+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:42:26 Translational-Medicine ollama[215328]: [132B blob data]
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  43:               general.quantization_version u32              = 2
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  44:                          general.file_type u32              = 15
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  45:                                   split.no u16              = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  46:                        split.tensors.count i32              = 1025
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  47:                                split.count u16              = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type  f32:  361 tensors
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q4_K:  606 tensors
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q6_K:   58 tensors
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: special tokens cache size = 819
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: token to piece cache size = 0.8223 MB
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: format           = GGUF V3 (latest)
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: arch             = deepseek2
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab type       = BPE
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_vocab          = 129280
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_merges         = 127741
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab_only       = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_train      = 163840
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd           = 7168
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer          = 61
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head           = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head_kv        = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_rot            = 64
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_swa            = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_k    = 192
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_v    = 128
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_gqa            = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_k_gqa     = 24576
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_v_gqa     = 16384
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff             = 18432
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert         = 256
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_used    = 8
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: causal attn      = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: pooling type     = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope type        = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope scaling     = yarn
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_base_train  = 10000.0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_scale_train = 0.025
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_orig_yarn  = 4096
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_finetuned   = unknown
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_conv       = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_inner      = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_state      = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_rank      = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model type       = 671B
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model ftype      = Q4_K - Medium
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model params     = 671.03 B
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model size       = 376.65 GiB (4.82 BPW)
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: general.name     = DeepSeek R1 BF16
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: PAD token        = 128815 '<|PAD▁TOKEN|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: LF token         = 131 'Ä'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: max token length = 256
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer_dense_lead   = 3
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_q             = 1536
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_kv            = 512
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff_exp             = 2048
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_shared      = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_scale = 2.5
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_norm  = 1
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_gating_func   = sigmoid
Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_yarn_log_mul    = 0.1000
Mar 18 09:42:39 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:39.924+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 09:42:40 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:40.183+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:42:40 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:40.886+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 09:42:41 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:41.588+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.039+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.291+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.742+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 09:43:03 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:03.468+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: offloading 20 repeating layers to GPU
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: offloaded 20/62 layers to GPU
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA0 model buffer size = 38929.58 MiB
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA1 model buffer size = 46036.25 MiB
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA2 model buffer size = 49746.68 MiB
Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors:   CPU_Mapped model buffer size = 250977.12 MiB
Mar 18 09:43:18 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:18.231+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 09:43:22 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:22.238+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_seq_max     = 1
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx         = 2048
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq = 2048
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_batch       = 512
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ubatch      = 512
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: flash_attn    = 0
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_base     = 10000.0
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_scale    = 0.025
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA0 KV buffer size =   960.00 MiB
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA1 KV buffer size =  1120.00 MiB
Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA2 KV buffer size =  1120.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_kv_cache_init:        CPU KV buffer size =  6560.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA0 compute buffer size =  5030.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA1 compute buffer size =   670.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA2 compute buffer size =   670.00 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model:  CUDA_Host compute buffer size =    84.01 MiB
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph nodes  = 5025
Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph splits = 770 (with bs=512), 5 (with bs=1)
Mar 18 09:43:23 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:23.494+08:00 level=INFO source=server.go:596 msg="llama runner started in 57.58 seconds"
Mar 18 09:43:23 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:43:23 | 200 |          1m1s |       127.0.0.1 | POST     "/api/generate"
Mar 18 09:43:37 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:43:37 | 200 |  5.938096276s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2731363690 --> @ssdy5366228 commented on GitHub (Mar 18, 2025): > --- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100 > +++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100 > @@ -15,7 +15,6 @@ > PARAMETER stop <|end▁of▁sentence|> > PARAMETER stop <|User|> > PARAMETER stop <|Assistant|> > -LICENSE """MIT License > #PARAMETER cache-type-k q4_0 > #PARAMETER > PARAMETER num_gpu 12 > @@ -24,6 +23,7 @@ > PARAMETER min_p 0.05 > TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" > > +LICENSE """MIT License > Copyright (c) 2023 DeepSeek > > Permission is hereby granted, free of charge, to any person obtaining a copy > --threads can be set with `num_thread` in the API call or Modelfile. --prio is not supported by the ollama runner. In order for `OLLAMA_KV_CACHE_TYPE` to take effect you also need to set `OLLAMA_FLASH_ATTENTION=1`. However, some models don't support FA at the moment. `PARAMETER num_thread 48` I added the parameters to the Modelfile and rebuilt the model file. ``` Environment="OLLAMA_MODELS=/data02/AI_TM/models" Environment="CUDA_VISIBLE_DEVICES=0,1,2" Environment="OLLAMA_GPU_OVERHEAD=2147483648" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" Environment="OLLAMA_NUM_PARALLEL=1" Environment="OLLAMA_MAX_LOADED_MODELS=4" Environment="OLLAMA_KV_CACHE_TYPE=q4_0" Environment="OLLAMA_NUM_THREADS=48" ``` and I set `OLLAMA_FLASH_ATTENTION=1` in my ollama.service ``` $ ollama run DeepSeek-R1-Q4_K_M:latest >>> 1+1=? "5.BE&D;CB1&CC.<|▁pad▁|>5<|▁pad▁|>81&/D^C >>> ``` The response is still garbled. The nvitop monitoring is as shown in the image, the parameters `--ctx-size`, `--n-gpu-layers`, `--threads` are still different from the preset ones. <img width="1443" alt="Image" src="https://github.com/user-attachments/assets/c4c4bae7-165e-4f6c-bc28-14d60387abb0" /> **Is there still an error in my parameter settings?** ``` Mar 18 09:42:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:42:22 | 200 | 62.186µs | 127.0.0.1 | HEAD "/" Mar 18 09:42:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:42:22 | 200 | 32.520952ms | 127.0.0.1 | POST "/api/show" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.413+08:00 level=INFO source=server.go:97 msg="system memory" total="1007.6 GiB" free="967.7 GiB" free_swap="30.7 GiB" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.912+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=62 layers.offload=20 layers.split=6,7,7 memory.available="[44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="411.4 GiB" memory.required.partial="119.8 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[37.9 GiB 41.4 GiB 40.5 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="1019.5 MiB" memory.graph.partial="1019.5 MiB" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=WARN source=server.go:175 msg="flash attention enabled but not supported by model" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=WARN source=server.go:193 msg="quantized kv cache requested but flash attention disabled" type=q4_0 Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.913+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a --ctx-size 2048 --batch-size 512 --n-gpu-layers 20 --threads 128 --parallel 1 --tensor-split 6,7,7 --port 35760" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.914+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" Mar 18 09:42:25 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:25.939+08:00 level=INFO source=runner.go:932 msg="starting go runner" Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 18 09:42:26 Translational-Medicine ollama[215328]: ggml_cuda_init: found 3 CUDA devices: Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 0: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 1: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 09:42:26 Translational-Medicine ollama[215328]: Device 2: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 09:42:26 Translational-Medicine ollama[215328]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 18 09:42:26 Translational-Medicine ollama[215328]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.245+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.245+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:35760" Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA0 (NVIDIA A40) - 45060 MiB free Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA1 (NVIDIA A40) - 45046 MiB free Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA2 (NVIDIA A40) - 45046 MiB free Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a (version GGUF V3 (latest)) Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 0: general.architecture str = deepseek2 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 1: general.type str = model Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 3: general.quantized_by str = Unsloth Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 4: general.size_label str = 256x20B Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3 Mar 18 09:42:26 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:26.419+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:42:26 Translational-Medicine ollama[215328]: [132B blob data] Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de... Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 43: general.quantization_version u32 = 2 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 44: general.file_type u32 = 15 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 45: split.no u16 = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 46: split.tensors.count i32 = 1025 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 47: split.count u16 = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type f32: 361 tensors Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q4_K: 606 tensors Mar 18 09:42:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q6_K: 58 tensors Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: special tokens cache size = 819 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_vocab: token to piece cache size = 0.8223 MB Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: format = GGUF V3 (latest) Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: arch = deepseek2 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab type = BPE Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_vocab = 129280 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_merges = 127741 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab_only = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_train = 163840 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd = 7168 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer = 61 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head_kv = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_rot = 64 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_swa = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_k = 192 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_v = 128 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_gqa = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_k_gqa = 24576 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_v_gqa = 16384 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_eps = 0.0e+00 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_logit_scale = 0.0e+00 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff = 18432 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert = 256 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_used = 8 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: causal attn = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: pooling type = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope type = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope scaling = yarn Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_base_train = 10000.0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_scale_train = 0.025 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_orig_yarn = 4096 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_finetuned = unknown Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_conv = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_inner = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_state = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_rank = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model type = 671B Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model ftype = Q4_K - Medium Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model params = 671.03 B Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model size = 376.65 GiB (4.82 BPW) Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: general.name = DeepSeek R1 BF16 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: LF token = 131 'Ä' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: max token length = 256 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer_dense_lead = 3 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_q = 1536 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_kv = 512 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff_exp = 2048 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_shared = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_scale = 2.5 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_norm = 1 Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_gating_func = sigmoid Mar 18 09:42:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_yarn_log_mul = 0.1000 Mar 18 09:42:39 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:39.924+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 09:42:40 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:40.183+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:42:40 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:40.886+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 09:42:41 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:41.588+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.039+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.291+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:42:42 Translational-Medicine ollama[215328]: time=2025-03-18T09:42:42.742+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 09:43:03 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:03.468+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: offloading 20 repeating layers to GPU Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: offloaded 20/62 layers to GPU Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA0 model buffer size = 38929.58 MiB Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA1 model buffer size = 46036.25 MiB Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA2 model buffer size = 49746.68 MiB Mar 18 09:43:03 Translational-Medicine ollama[215328]: llm_load_tensors: CPU_Mapped model buffer size = 250977.12 MiB Mar 18 09:43:18 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:18.231+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 09:43:22 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:22.238+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_seq_max = 1 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx = 2048 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq = 2048 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_batch = 512 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ubatch = 512 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: flash_attn = 0 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_base = 10000.0 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_scale = 0.025 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA0 KV buffer size = 960.00 MiB Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA1 KV buffer size = 1120.00 MiB Mar 18 09:43:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA2 KV buffer size = 1120.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_kv_cache_init: CPU KV buffer size = 6560.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: CPU output buffer size = 0.52 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA0 compute buffer size = 5030.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA1 compute buffer size = 670.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA2 compute buffer size = 670.00 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA_Host compute buffer size = 84.01 MiB Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph nodes = 5025 Mar 18 09:43:23 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph splits = 770 (with bs=512), 5 (with bs=1) Mar 18 09:43:23 Translational-Medicine ollama[215328]: time=2025-03-18T09:43:23.494+08:00 level=INFO source=server.go:596 msg="llama runner started in 57.58 seconds" Mar 18 09:43:23 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:43:23 | 200 | 1m1s | 127.0.0.1 | POST "/api/generate" Mar 18 09:43:37 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 09:43:37 | 200 | 5.938096276s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@rick-github commented on GitHub (Mar 18, 2025):

Did you fix the LICENSE definition?

<!-- gh-comment-id:2731366628 --> @rick-github commented on GitHub (Mar 18, 2025): Did you fix the `LICENSE` definition?
Author
Owner

@ssdy5366228 commented on GitHub (Mar 18, 2025):

Did you fix the LICENSE definition?

Thank you so much for your help! I changed the position of the commented code block and successfully ran the model with the preset parameters.

But still, the response is still garbled

$ ollama run DeepSeek-R1-Q4_K_M:latest
>>> introduce yourself
:=3B*<|▁pad▁|>85(27:;"4'?<|▁pad▁|>4&>"(%.<|▁pad▁|><|▁pad▁|>#69?<|▁pad▁|>.DC<|▁pad▁|>EC(!--&%%0&:%E<|▁pad▁|>=B&)*7<|▁pad▁|>C+&+&/)BE

>>>

I’m trying to figure it out.
The journalctl log is as follows:

Mar 18 10:56:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:56:22 | 200 |       47.73µs |       127.0.0.1 | HEAD     "/"
Mar 18 10:56:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:56:22 | 200 |   20.116161ms |       127.0.0.1 | POST     "/api/show"
Mar 18 10:56:24 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:24.937+08:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a library=cuda parallel=1 required="80.5 GiB"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.438+08:00 level=INFO source=server.go:97 msg="system memory" total="1007.6 GiB" free="967.5 GiB" free_swap="30.7 GiB"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=12 layers.model=62 layers.offload=12 layers.split=4,4,4 memory.available="[44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="451.2 GiB" memory.required.partial="80.5 GiB" memory.required.kv="38.1 GiB" memory.required.allocations="[28.0 GiB 26.2 GiB 26.2 GiB]" memory.weights.total="413.6 GiB" memory.weights.repeating="412.9 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="3.0 GiB" memory.graph.partial="3.0 GiB"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=WARN source=server.go:175 msg="flash attention enabled but not supported by model"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=WARN source=server.go:193 msg="quantized kv cache requested but flash attention disabled" type=q4_0
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a --ctx-size 8192 --batch-size 512 --n-gpu-layers 12 --threads 48 --parallel 1 --tensor-split 4,4,4 --port 36025"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.938+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.938+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.939+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.968+08:00 level=INFO source=runner.go:932 msg="starting go runner"
Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: found 3 CUDA devices:
Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 0: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 1: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 2: NVIDIA A40, compute capability 8.6, VMM: yes
Mar 18 10:56:26 Translational-Medicine ollama[215328]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Mar 18 10:56:26 Translational-Medicine ollama[215328]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.281+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=48
Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.281+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:36025"
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA0 (NVIDIA A40) - 45106 MiB free
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA1 (NVIDIA A40) - 45054 MiB free
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA2 (NVIDIA A40) - 45040 MiB free
Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.442+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a (version GGUF V3 (latest))
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   1:                               general.type str              = model
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
Mar 18 10:56:26 Translational-Medicine ollama[215328]: [132B blob data]
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  43:               general.quantization_version u32              = 2
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  44:                          general.file_type u32              = 15
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  45:                                   split.no u16              = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  46:                        split.tensors.count i32              = 1025
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv  47:                                split.count u16              = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type  f32:  361 tensors
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q4_K:  606 tensors
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q6_K:   58 tensors
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: special tokens cache size = 819
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: token to piece cache size = 0.8223 MB
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: format           = GGUF V3 (latest)
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: arch             = deepseek2
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab type       = BPE
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_vocab          = 129280
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_merges         = 127741
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab_only       = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_train      = 163840
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd           = 7168
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer          = 61
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head           = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head_kv        = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_rot            = 64
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_swa            = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_k    = 192
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_v    = 128
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_gqa            = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_k_gqa     = 24576
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_v_gqa     = 16384
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff             = 18432
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert         = 256
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_used    = 8
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: causal attn      = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: pooling type     = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope type        = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope scaling     = yarn
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_base_train  = 10000.0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_scale_train = 0.025
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_orig_yarn  = 4096
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_finetuned   = unknown
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_conv       = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_inner      = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_state      = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_rank      = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model type       = 671B
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model ftype      = Q4_K - Medium
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model params     = 671.03 B
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model size       = 376.65 GiB (4.82 BPW)
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: general.name     = DeepSeek R1 BF16
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: PAD token        = 128815 '<|PAD▁TOKEN|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: LF token         = 131 'Ä'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: max token length = 256
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer_dense_lead   = 3
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_q             = 1536
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_kv            = 512
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff_exp             = 2048
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_shared      = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_scale = 2.5
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_norm  = 1
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_gating_func   = sigmoid
Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_yarn_log_mul    = 0.1000
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: offloading 12 repeating layers to GPU
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: offloaded 12/62 layers to GPU
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA0 model buffer size = 25643.85 MiB
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA1 model buffer size = 28426.68 MiB
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors:        CUDA2 model buffer size = 28426.68 MiB
Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors:   CPU_Mapped model buffer size = 303192.43 MiB
Mar 18 10:57:11 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:11.843+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding"
Mar 18 10:57:14 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:14.492+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_seq_max     = 1
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx         = 8192
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq = 8192
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_batch       = 512
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ubatch      = 512
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: flash_attn    = 0
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_base     = 10000.0
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_scale    = 0.025
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA0 KV buffer size =  2560.00 MiB
Mar 18 10:57:15 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA1 KV buffer size =  2560.00 MiB
Mar 18 10:57:15 Translational-Medicine ollama[215328]: llama_kv_cache_init:      CUDA2 KV buffer size =  2560.00 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_kv_cache_init:        CPU KV buffer size = 31360.00 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: KV self size  = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA0 compute buffer size =  5039.50 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA1 compute buffer size =  2218.00 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model:      CUDA2 compute buffer size =  2218.00 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model:  CUDA_Host compute buffer size =    96.01 MiB
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph nodes  = 5025
Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph splits = 922 (with bs=512), 5 (with bs=1)
Mar 18 10:57:22 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:22.780+08:00 level=INFO source=server.go:596 msg="llama runner started in 56.84 seconds"
Mar 18 10:57:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:57:22 | 200 |          1m0s |       127.0.0.1 | POST     "/api/generate"
Mar 18 10:59:47 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:59:47 | 200 | 32.082671331s |       127.0.0.1 | POST     "/api/chat"
<!-- gh-comment-id:2731496555 --> @ssdy5366228 commented on GitHub (Mar 18, 2025): > Did you fix the `LICENSE` definition? Thank you so much for your help! I changed the position of the commented code block and successfully ran the model with the preset parameters. But still, the response is still garbled ``` $ ollama run DeepSeek-R1-Q4_K_M:latest >>> introduce yourself :=3B*<|▁pad▁|>85(27:;"4'?<|▁pad▁|>4&>"(%.<|▁pad▁|><|▁pad▁|>#69?<|▁pad▁|>.DC<|▁pad▁|>EC(!--&%%0&:%E<|▁pad▁|>=B&)*7<|▁pad▁|>C+&+&/)BE >>> ``` I’m trying to figure it out. The journalctl log is as follows: ``` Mar 18 10:56:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:56:22 | 200 | 47.73µs | 127.0.0.1 | HEAD "/" Mar 18 10:56:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:56:22 | 200 | 20.116161ms | 127.0.0.1 | POST "/api/show" Mar 18 10:56:24 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:24.937+08:00 level=INFO source=sched.go:731 msg="new model will fit in available VRAM, loading" model=/data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a library=cuda parallel=1 required="80.5 GiB" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.438+08:00 level=INFO source=server.go:97 msg="system memory" total="1007.6 GiB" free="967.5 GiB" free_swap="30.7 GiB" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=12 layers.model=62 layers.offload=12 layers.split=4,4,4 memory.available="[44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="451.2 GiB" memory.required.partial="80.5 GiB" memory.required.kv="38.1 GiB" memory.required.allocations="[28.0 GiB 26.2 GiB 26.2 GiB]" memory.weights.total="413.6 GiB" memory.weights.repeating="412.9 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="3.0 GiB" memory.graph.partial="3.0 GiB" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=WARN source=server.go:175 msg="flash attention enabled but not supported by model" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=WARN source=server.go:193 msg="quantized kv cache requested but flash attention disabled" type=q4_0 Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.937+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a --ctx-size 8192 --batch-size 512 --n-gpu-layers 12 --threads 48 --parallel 1 --tensor-split 4,4,4 --port 36025" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.938+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.938+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.939+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" Mar 18 10:56:25 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:25.968+08:00 level=INFO source=runner.go:932 msg="starting go runner" Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Mar 18 10:56:26 Translational-Medicine ollama[215328]: ggml_cuda_init: found 3 CUDA devices: Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 0: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 1: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 10:56:26 Translational-Medicine ollama[215328]: Device 2: NVIDIA A40, compute capability 8.6, VMM: yes Mar 18 10:56:26 Translational-Medicine ollama[215328]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Mar 18 10:56:26 Translational-Medicine ollama[215328]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.281+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=48 Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.281+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:36025" Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA0 (NVIDIA A40) - 45106 MiB free Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA1 (NVIDIA A40) - 45054 MiB free Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_load_model_from_file: using device CUDA2 (NVIDIA A40) - 45040 MiB free Mar 18 10:56:26 Translational-Medicine ollama[215328]: time=2025-03-18T10:56:26.442+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /data02/AI_TM/models/blobs/sha256-79834e94e6ca156be1a57c6cf8795a0a9afd8eaed8dfca6247340b0e06c9553a (version GGUF V3 (latest)) Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 0: general.architecture str = deepseek2 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 1: general.type str = model Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 3: general.quantized_by str = Unsloth Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 4: general.size_label str = 256x20B Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 6: deepseek2.block_count u32 = 61 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3 Mar 18 10:56:26 Translational-Medicine ollama[215328]: [132B blob data] Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de... Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 43: general.quantization_version u32 = 2 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 44: general.file_type u32 = 15 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 45: split.no u16 = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 46: split.tensors.count i32 = 1025 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - kv 47: split.count u16 = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type f32: 361 tensors Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q4_K: 606 tensors Mar 18 10:56:26 Translational-Medicine ollama[215328]: llama_model_loader: - type q6_K: 58 tensors Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: special tokens cache size = 819 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_vocab: token to piece cache size = 0.8223 MB Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: format = GGUF V3 (latest) Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: arch = deepseek2 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab type = BPE Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_vocab = 129280 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_merges = 127741 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: vocab_only = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_train = 163840 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd = 7168 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer = 61 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_head_kv = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_rot = 64 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_swa = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_k = 192 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_head_v = 128 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_gqa = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_k_gqa = 24576 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_embd_v_gqa = 16384 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_eps = 0.0e+00 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: f_logit_scale = 0.0e+00 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff = 18432 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert = 256 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_used = 8 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: causal attn = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: pooling type = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope type = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope scaling = yarn Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_base_train = 10000.0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: freq_scale_train = 0.025 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ctx_orig_yarn = 4096 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_finetuned = unknown Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_conv = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_inner = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_d_state = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_rank = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model type = 671B Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model ftype = Q4_K - Medium Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model params = 671.03 B Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: model size = 376.65 GiB (4.82 BPW) Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: general.name = DeepSeek R1 BF16 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: LF token = 131 'Ä' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: max token length = 256 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_layer_dense_lead = 3 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_q = 1536 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_lora_kv = 512 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_ff_exp = 2048 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: n_expert_shared = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_scale = 2.5 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_weights_norm = 1 Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: expert_gating_func = sigmoid Mar 18 10:56:26 Translational-Medicine ollama[215328]: llm_load_print_meta: rope_yarn_log_mul = 0.1000 Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: offloading 12 repeating layers to GPU Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: offloaded 12/62 layers to GPU Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA0 model buffer size = 25643.85 MiB Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA1 model buffer size = 28426.68 MiB Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: CUDA2 model buffer size = 28426.68 MiB Mar 18 10:57:02 Translational-Medicine ollama[215328]: llm_load_tensors: CPU_Mapped model buffer size = 303192.43 MiB Mar 18 10:57:11 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:11.843+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server not responding" Mar 18 10:57:14 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:14.492+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_seq_max = 1 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx = 8192 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq = 8192 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_batch = 512 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ubatch = 512 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: flash_attn = 0 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_base = 10000.0 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: freq_scale = 0.025 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 Mar 18 10:57:14 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA0 KV buffer size = 2560.00 MiB Mar 18 10:57:15 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA1 KV buffer size = 2560.00 MiB Mar 18 10:57:15 Translational-Medicine ollama[215328]: llama_kv_cache_init: CUDA2 KV buffer size = 2560.00 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_kv_cache_init: CPU KV buffer size = 31360.00 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: KV self size = 39040.00 MiB, K (f16): 23424.00 MiB, V (f16): 15616.00 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: CPU output buffer size = 0.52 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA0 compute buffer size = 5039.50 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA1 compute buffer size = 2218.00 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA2 compute buffer size = 2218.00 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: CUDA_Host compute buffer size = 96.01 MiB Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph nodes = 5025 Mar 18 10:57:22 Translational-Medicine ollama[215328]: llama_new_context_with_model: graph splits = 922 (with bs=512), 5 (with bs=1) Mar 18 10:57:22 Translational-Medicine ollama[215328]: time=2025-03-18T10:57:22.780+08:00 level=INFO source=server.go:596 msg="llama runner started in 56.84 seconds" Mar 18 10:57:22 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:57:22 | 200 | 1m0s | 127.0.0.1 | POST "/api/generate" Mar 18 10:59:47 Translational-Medicine ollama[215328]: [GIN] 2025/03/18 - 10:59:47 | 200 | 32.082671331s | 127.0.0.1 | POST "/api/chat" ```
Author
Owner

@ssdy5366228 commented on GitHub (Mar 19, 2025):

--- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100
+++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100
@@ -15,7 +15,6 @@
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
-LICENSE """MIT License
#PARAMETER cache-type-k q4_0
#PARAMETER
PARAMETER num_gpu 12
@@ -24,6 +23,7 @@
PARAMETER min_p 0.05
TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>"

+LICENSE """MIT License
Copyright (c) 2023 DeepSeek

Permission is hereby granted, free of charge, to any person obtaining a copy
--threads can be set with num_thread in the API call or Modelfile. --prio is not supported by the ollama runner. In order for OLLAMA_KV_CACHE_TYPE to take effect you also need to set OLLAMA_FLASH_ATTENTION=1. However, some models don't support FA at the moment.

I think I might’ve figured it out. Even though I set Environment="OLLAMA_FLASH_ATTENTION=1" and Environment="OLLAMA_KV_CACHE_TYPE=q4_0" in ollama.service, the logs still say “flash attention enabled but not supported by model”, and it’s defaulting to type_k = 'f16'.

What’s weird is that llama-cli can use type_k = 'q4_0' just fine, so it doesn’t seem like the model itself is the issue. Does that sound right?

<!-- gh-comment-id:2735120378 --> @ssdy5366228 commented on GitHub (Mar 19, 2025): > --- myR1gguf_Modelfile.orig 2025-03-17 10:32:26.752810786 +0100 > +++ myR1gguf_Modelfile 2025-03-17 10:33:15.158116284 +0100 > @@ -15,7 +15,6 @@ > PARAMETER stop <|end▁of▁sentence|> > PARAMETER stop <|User|> > PARAMETER stop <|Assistant|> > -LICENSE """MIT License > #PARAMETER cache-type-k q4_0 > #PARAMETER > PARAMETER num_gpu 12 > @@ -24,6 +23,7 @@ > PARAMETER min_p 0.05 > TEMPLATE "<|User|>{{ .System }} {{ .Prompt }}<|Assistant|>" > > +LICENSE """MIT License > Copyright (c) 2023 DeepSeek > > Permission is hereby granted, free of charge, to any person obtaining a copy > --threads can be set with `num_thread` in the API call or Modelfile. --prio is not supported by the ollama runner. In order for `OLLAMA_KV_CACHE_TYPE` to take effect you also need to set `OLLAMA_FLASH_ATTENTION=1`. However, some models don't support FA at the moment. I think I might’ve figured it out. Even though I set Environment="OLLAMA_FLASH_ATTENTION=1" and Environment="OLLAMA_KV_CACHE_TYPE=q4_0" in ollama.service, the logs still say “flash attention enabled but not supported by model”, and it’s defaulting to type_k = 'f16'. What’s weird is that llama-cli can use type_k = 'q4_0' just fine, so it doesn’t seem like the model itself is the issue. Does that sound right?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52929