[GH-ISSUE #10250] ollama in Gentoo Doesn't Use 1080ti GPU, Falls back to CPU #53238

Closed
opened 2026-04-29 02:25:11 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @nullagit on GitHub (Apr 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10250

What is the issue?

I expect to see ollama listed in nvidia-smi and an uptick in utilization on my GPU but it never happens. Instead, my CPU utilization spikes while GPU usage remains flat. You'll notice CUDA 11.8 and Driver 535 because I down-graded from CUDA 12 and Driver 570 at the advice of Grok (due to better Pascal 6.1 support?).

Relevant log output

# /etc/init.d/ollama status
 * status: started
# nvidia-smi 
Sat Apr 12 12:40:35 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off | 00000000:81:00.0  On |                  N/A |
| 17%   56C    P0              62W / 250W |    559MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      7179      G   /usr/bin/X                                  273MiB |
|    0   N/A  N/A      7234      G   /usr/lib64/librewolf/librewolf              159MiB |
|    0   N/A  N/A      7411      G   ...72,262144 --variations-seed-version       97MiB |
|    0   N/A  N/A      7692      G   ...erProcess --variations-seed-version       22MiB |
+---------------------------------------------------------------------------------------+
# 

# ldd /usr/bin/ollama | grep cuda
   libcudart.so.11.0 => /opt/cuda/lib64/libcudart.so.11.0 (0x00007f24f9e00000)
# ldd /usr/lib64/ollama/cuda_v11/libggml-cuda.so | grep cudart
   libcudart.so.11.0 => /opt/cuda/lib64/libcudart.so.11.0 (0x00007f4d73c00000)

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

> groups ollama
video render ollama

> ./test_cuda
Found 1 CUDA devices
Device 0: NVIDIA GeForce GTX 1080 Ti, Compute Capability: 6.1

> lsmod | grep nvid
nvidia_uvm           1548288  2
nvidia_drm             81920  6
nvidia_modeset       1503232  9 nvidia_drm
nvidia              62107648  552 nvidia_uvm,nvidia_modeset
drm_kms_helper        266240  1 nvidia_drm
drm                   786432  10 drm_kms_helper,nvidia,nvidia_drm
video                  77824  1 nvidia_modeset
backlight              24576  3 video,drm,nvidia_modeset
i2c_core              135168  6 i2c_designware_platform,i2c_designware_core,drm_kms_helper,nvi ia,i2c_piix4,drm
Version (from ebuild):
# cat /usr/local/portage/sci-ml/ollama/ollama-9999.ebuild
# Copyright 2024-2025 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2

EAPI=8

# supports ROCM/HIP >=5.5, but we define 6.1 due to the eclass
ROCM_VERSION=6.1
inherit cuda rocm
inherit cmake
inherit go-module systemd toolchain-funcs

DESCRIPTION="Get up and running with Llama 3, Mistral, Gemma, and other language models."
HOMEPAGE="https://ollama.com"

if [[ ${PV} == *9999* ]]; then
	inherit git-r3
	EGIT_REPO_URI="https://github.com/ollama/ollama.git"
else
	SRC_URI="
		https://github.com/ollama/${PN}/archive/refs/tags/v${PV}.tar.gz -> ${P}.gh.tar.gz
		https://github.com/negril/gentoo-overlay-vendored/raw/refs/heads/blobs/${P}-vendor.tar.xz
	"
	KEYWORDS="~amd64"
fi

Build-log

# egrep -i 'cuda|nvid|error|llama_server' /var/log/portage-build.log/sci-ml:ollama-9999:20250412-160521.log | egrep -v 'fPIC|packagefile |imports -|TERM=|modinfo|go: download|golang.org|errors|internal'
 * USE:        abi_x86_64 amd64 amdgpu_targets_gfx1030 amdgpu_targets_gfx1100 amdgpu_targets_gfx906 amdgpu_targets_gfx908 amdgpu_targets_gfx90a amdgpu_targets_gfx942 cpu_flags_x86_avx cpu_flags_x86_avx2 cpu_flags_x86_avx512_bf16 cpu_flags_x86_avx512_vnni cpu_flags_x86_avx512f cpu_flags_x86_avx512vbmi cpu_flags_x86_f16c cpu_flags_x86_fma3 cuda elibc_glibc kernel_linux
cmake -C /var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999_build/gentoo_common_config.cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/usr -DGGML_CCACHE=no -DGGML_BLAS=no -DCMAKE_CUDA_ARCHITECTURES=61 -DCMAKE_HIP_COMPILER=NOTFOUND -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_TOOLCHAIN_FILE=/var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999_build/gentoo_toolchain.cmake /var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /opt/cuda/bin/nvcc
-- Looking for a CUDA host compiler - /usr/x86_64-pc-linux-gnu/gcc-bin/11
-- Found CUDAToolkit: /opt/cuda/targets/x86_64-linux/include (found version "11.8.89")
-- CUDA Toolkit found
-- Using CUDA architectures: 61
-- The CUDA compiler identification is NVIDIA 11.8.89 with host compiler GNU 11.5.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
/usr/bin/gcc-11 -Wl,--no-gc-sections -L/opt/cuda/lib64 -lcudart -x c - -o /dev/null || true
/usr/bin/g++-11 -Wl,--no-gc-sections -L/opt/cuda/lib64 -lcudart -x c - -o /dev/null || true
-- Installing: /var/tmp/portage/sci-ml/ollama-9999/image/usr/lib64/ollama/cuda_v11/libggml-cuda.so
-- Set non-toolchain portion of runtime path of "/var/tmp/portage/sci-ml/ollama-9999/image/usr/lib64/ollama/cuda_v11/libggml-cuda.so" to ""
>>> /usr/lib64/ollama/cuda_v11/
>>> /usr/lib64/ollama/cuda_v11/libggml-cuda.so

CPU info

# cat /proc/cpuinfo | more
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 24
model name	: AMD Ryzen Threadripper 7970X 32-Cores
stepping	: 1
microcode	: 0xa108108
cpu MHz		: 1500.000
cache size	: 1024 KB
physical id	: 0
siblings	: 64
core id		: 0
cpu cores	: 32
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_ts
c cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osv
w ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rd
t_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveer
ptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke
 avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso
bogomips	: 7990.25
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 52 bits physical, 57 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
> cat /var/log/ollama/ollama.log
2025/04/12 12:40:18 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-04-12T12:40:18.316-04:00 level=INFO source=images.go:458 msg="total blobs: 16"
time=2025-04-12T12:40:18.316-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[...]
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-04-12T12:40:18.316-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2025-04-12T12:40:18.316-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-12T12:40:18.401-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.2 GiB"
[GIN] 2025/04/12 - 12:40:22 | 200 |      73.935µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/12 - 12:40:22 | 200 |   23.017086ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:40:22.745-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=10978787328 required="6.2 GiB"
time=2025-04-12T12:40:22.808-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.0 GiB" free_swap="8.0 GiB"
time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:40:22.808-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --threads 32 --parallel 4 --port 33883"
time=2025-04-12T12:40:23.003-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-12T12:40:23.014-04:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-12T12:40:23.039-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-12T12:40:23.040-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:33883"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-04-12T12:40:23.255-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size =  4437.80 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.02 MiB
llama_init_from_model:        CPU compute buffer size =   560.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
time=2025-04-12T12:40:23.757-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds"
[GIN] 2025/04/12 - 12:40:23 | 200 |  1.136202653s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/04/12 - 12:40:41 | 200 | 13.837908528s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/04/12 - 12:45:36 | 200 |      43.858µs |       127.0.0.1 | GET      "/api/version"
time=2025-04-12T12:45:46.955-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.103440919 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T12:45:47.205-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.354226848 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T12:45:47.455-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.603979566 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
2025/04/12 12:57:06 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-04-12T12:57:06.234-04:00 level=INFO source=images.go:458 msg="total blobs: 16"
time=2025-04-12T12:57:06.235-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[...]
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-04-12T12:57:06.235-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2025-04-12T12:57:06.235-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-12T12:57:06.318-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.4 GiB"
[GIN] 2025/04/12 - 12:57:10 | 200 |      72.973µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/12 - 12:57:10 | 200 |   22.671641ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:57:10.739-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11136401408 required="6.2 GiB"
time=2025-04-12T12:57:10.806-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.1 GiB" free_swap="8.0 GiB"
time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:57:10.806-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --threads 32 --parallel 4 --port 33305"
time=2025-04-12T12:57:10.963-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-12T12:57:10.972-04:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-12T12:57:10.997-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-12T12:57:10.998-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:33305"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-04-12T12:57:11.215-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size =  4437.80 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.02 MiB
llama_init_from_model:        CPU compute buffer size =   560.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
time=2025-04-12T12:57:11.717-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds"
[GIN] 2025/04/12 - 12:57:11 | 200 |  1.083427103s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/04/12 - 12:57:27 | 200 |  4.652757822s |       127.0.0.1 | POST     "/api/chat"
> cat ollama_debug.log
[...]
time=2025-04-12T12:24:13.002-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="242.3 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="242.2 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib64/libcuda.so.535.230.02
dlsym: cuInit - 0x7fa940cc2470
dlsym: cuDriverGetVersion - 0x7fa940cc2490
dlsym: cuDeviceGetCount - 0x7fa940cc24d0
dlsym: cuDeviceGet - 0x7fa940cc24b0
dlsym: cuDeviceGetAttribute - 0x7fa940cc25b0
dlsym: cuDeviceGetUuid - 0x7fa940cc2510
dlsym: cuDeviceGetName - 0x7fa940cc24f0
dlsym: cuCtxCreate_v3 - 0x7fa940cca170
dlsym: cuMemGetInfo_v2 - 0x7fa940cd5640
dlsym: cuCtxDestroy - 0x7fa940d24640
calling cuInit
calling cuDriverGetVersion
raw version 0x2ef4
CUDA driver version: 12.2
calling cuDeviceGetCount
device count 1
time=2025-04-12T12:24:13.077-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.4 GiB" now.total="10.9 GiB" now.free="10.4 GiB" now.used="506.1 MiB"
releasing cuda driver library
time=2025-04-12T12:24:13.077-04:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-04-12T12:24:13.115-04:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T12:24:13.115-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.4 GiB]"
time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:24:13.115-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11181293568 required="6.2 GiB"
time=2025-04-12T12:24:13.116-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="242.2 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="242.2 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib64/libcuda.so.535.230.02
dlsym: cuInit - 0x7fa940cc2470
dlsym: cuDriverGetVersion - 0x7fa940cc2490
dlsym: cuDeviceGetCount - 0x7fa940cc24d0
dlsym: cuDeviceGet - 0x7fa940cc24b0
dlsym: cuDeviceGetAttribute - 0x7fa940cc25b0
dlsym: cuDeviceGetUuid - 0x7fa940cc2510
dlsym: cuDeviceGetName - 0x7fa940cc24f0
dlsym: cuCtxCreate_v3 - 0x7fa940cca170
dlsym: cuMemGetInfo_v2 - 0x7fa940cd5640
dlsym: cuCtxDestroy - 0x7fa940d24640
calling cuInit
calling cuDriverGetVersion
raw version 0x2ef4
CUDA driver version: 12.2
calling cuDeviceGetCount
device count 1
time=2025-04-12T12:24:13.193-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.4 GiB" now.total="10.9 GiB" now.free="10.4 GiB" now.used="506.1 MiB"
releasing cuda driver library
time=2025-04-12T12:24:13.193-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.2 GiB" free_swap="8.0 GiB"
time=2025-04-12T12:24:13.193-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.4 GiB]"
time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T12:24:13.193-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2025-04-12T12:24:13.195-04:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG
load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG
[...trunc...]
load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --verbose --threads 32 --parallel 4 --port 41227"
time=2025-04-12T12:24:13.351-04:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_LAUNCH_BLOCKING=1 CUDA_CACHE_PATH=/home/sysop/.cache/nv LD_LIBRARY_PATH=/opt/cuda/lib64:/usr/lib64:/opt/cuda/lib64:/usr/lib64::/usr/bin PATH=/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/sysop/bin:/home/sysop/.local/bin CUDA_VISIBLE_DEVICES=GPU-41aae151-e334-6117-350f-1ab006f81f09]"
time=2025-04-12T12:24:13.351-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-12T12:24:13.362-04:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/home/sysop
time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin
time=2025-04-12T12:24:13.388-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-12T12:24:13.389-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:41227"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG
load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG
[...trunc...]
load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
[...]
load_tensors: layer  30 assigned to device CPU
load_tensors: layer  31 assigned to device CPU
load_tensors: layer  32 assigned to device CPU
time=2025-04-12T12:24:13.603-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size =  4437.80 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
[...]
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
time=2025-04-12T12:24:13.854-04:00 level=DEBUG source=server.go:625 msg="model load progress 1.00"
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.02 MiB
llama_init_from_model:        CPU compute buffer size =   560.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
time=2025-04-12T12:24:14.106-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds"
time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
[GIN] 2025/04/12 - 12:24:14 | 200 |  1.128327651s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:468 msg="context for request finished"
time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2025-04-12T12:24:19.992-04:00 level=DEBUG source=sched.go:577 msg="evaluating already loaded" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T12:24:19.992-04:00 level=DEBUG source=routes.go:1522 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nwrite 1 random sentence.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2025-04-12T12:24:19.993-04:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=16 used=0 remaining=16
[GIN] 2025/04/12 - 12:24:20 | 200 |  973.844479ms |       127.0.0.1 | POST     "/api/chat"
time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:409 msg="context for request finished"
time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.0.0

Originally created by @nullagit on GitHub (Apr 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10250 ### What is the issue? I expect to see ollama listed in nvidia-smi and an uptick in utilization on my GPU but it never happens. Instead, my CPU utilization spikes while GPU usage remains flat. You'll notice CUDA 11.8 and Driver 535 because I down-graded from CUDA 12 and Driver 570 at the advice of Grok (due to better Pascal 6.1 support?). ### Relevant log output ```shell # /etc/init.d/ollama status * status: started # nvidia-smi Sat Apr 12 12:40:35 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:81:00.0 On | N/A | | 17% 56C P0 62W / 250W | 559MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 7179 G /usr/bin/X 273MiB | | 0 N/A N/A 7234 G /usr/lib64/librewolf/librewolf 159MiB | | 0 N/A N/A 7411 G ...72,262144 --variations-seed-version 97MiB | | 0 N/A N/A 7692 G ...erProcess --variations-seed-version 22MiB | +---------------------------------------------------------------------------------------+ # # ldd /usr/bin/ollama | grep cuda libcudart.so.11.0 => /opt/cuda/lib64/libcudart.so.11.0 (0x00007f24f9e00000) # ldd /usr/lib64/ollama/cuda_v11/libggml-cuda.so | grep cudart libcudart.so.11.0 => /opt/cuda/lib64/libcudart.so.11.0 (0x00007f4d73c00000) # nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 > groups ollama video render ollama > ./test_cuda Found 1 CUDA devices Device 0: NVIDIA GeForce GTX 1080 Ti, Compute Capability: 6.1 > lsmod | grep nvid nvidia_uvm 1548288 2 nvidia_drm 81920 6 nvidia_modeset 1503232 9 nvidia_drm nvidia 62107648 552 nvidia_uvm,nvidia_modeset drm_kms_helper 266240 1 nvidia_drm drm 786432 10 drm_kms_helper,nvidia,nvidia_drm video 77824 1 nvidia_modeset backlight 24576 3 video,drm,nvidia_modeset i2c_core 135168 6 i2c_designware_platform,i2c_designware_core,drm_kms_helper,nvi ia,i2c_piix4,drm ``` ```shell Version (from ebuild): # cat /usr/local/portage/sci-ml/ollama/ollama-9999.ebuild # Copyright 2024-2025 Gentoo Authors # Distributed under the terms of the GNU General Public License v2 EAPI=8 # supports ROCM/HIP >=5.5, but we define 6.1 due to the eclass ROCM_VERSION=6.1 inherit cuda rocm inherit cmake inherit go-module systemd toolchain-funcs DESCRIPTION="Get up and running with Llama 3, Mistral, Gemma, and other language models." HOMEPAGE="https://ollama.com" if [[ ${PV} == *9999* ]]; then inherit git-r3 EGIT_REPO_URI="https://github.com/ollama/ollama.git" else SRC_URI=" https://github.com/ollama/${PN}/archive/refs/tags/v${PV}.tar.gz -> ${P}.gh.tar.gz https://github.com/negril/gentoo-overlay-vendored/raw/refs/heads/blobs/${P}-vendor.tar.xz " KEYWORDS="~amd64" fi ``` Build-log ```shell # egrep -i 'cuda|nvid|error|llama_server' /var/log/portage-build.log/sci-ml:ollama-9999:20250412-160521.log | egrep -v 'fPIC|packagefile |imports -|TERM=|modinfo|go: download|golang.org|errors|internal' * USE: abi_x86_64 amd64 amdgpu_targets_gfx1030 amdgpu_targets_gfx1100 amdgpu_targets_gfx906 amdgpu_targets_gfx908 amdgpu_targets_gfx90a amdgpu_targets_gfx942 cpu_flags_x86_avx cpu_flags_x86_avx2 cpu_flags_x86_avx512_bf16 cpu_flags_x86_avx512_vnni cpu_flags_x86_avx512f cpu_flags_x86_avx512vbmi cpu_flags_x86_f16c cpu_flags_x86_fma3 cuda elibc_glibc kernel_linux cmake -C /var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999_build/gentoo_common_config.cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/usr -DGGML_CCACHE=no -DGGML_BLAS=no -DCMAKE_CUDA_ARCHITECTURES=61 -DCMAKE_HIP_COMPILER=NOTFOUND -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_TOOLCHAIN_FILE=/var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999_build/gentoo_toolchain.cmake /var/tmp/portage/sci-ml/ollama-9999/work/ollama-9999 -- Looking for a CUDA compiler -- Looking for a CUDA compiler - /opt/cuda/bin/nvcc -- Looking for a CUDA host compiler - /usr/x86_64-pc-linux-gnu/gcc-bin/11 -- Found CUDAToolkit: /opt/cuda/targets/x86_64-linux/include (found version "11.8.89") -- CUDA Toolkit found -- Using CUDA architectures: 61 -- The CUDA compiler identification is NVIDIA 11.8.89 with host compiler GNU 11.5.0 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done /usr/bin/gcc-11 -Wl,--no-gc-sections -L/opt/cuda/lib64 -lcudart -x c - -o /dev/null || true /usr/bin/g++-11 -Wl,--no-gc-sections -L/opt/cuda/lib64 -lcudart -x c - -o /dev/null || true -- Installing: /var/tmp/portage/sci-ml/ollama-9999/image/usr/lib64/ollama/cuda_v11/libggml-cuda.so -- Set non-toolchain portion of runtime path of "/var/tmp/portage/sci-ml/ollama-9999/image/usr/lib64/ollama/cuda_v11/libggml-cuda.so" to "" >>> /usr/lib64/ollama/cuda_v11/ >>> /usr/lib64/ollama/cuda_v11/libggml-cuda.so ``` CPU info ```shell # cat /proc/cpuinfo | more processor : 0 vendor_id : AuthenticAMD cpu family : 25 model : 24 model name : AMD Ryzen Threadripper 7970X 32-Cores stepping : 1 microcode : 0xa108108 cpu MHz : 1500.000 cache size : 1024 KB physical id : 0 siblings : 64 core id : 0 cpu cores : 32 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 16 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_ts c cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osv w ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rd t_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveer ptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d debug_swap bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso bogomips : 7990.25 TLB size : 3584 4K pages clflush size : 64 cache_alignment : 64 address sizes : 52 bits physical, 57 bits virtual power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] ``` ```shell > cat /var/log/ollama/ollama.log 2025/04/12 12:40:18 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-04-12T12:40:18.316-04:00 level=INFO source=images.go:458 msg="total blobs: 16" time=2025-04-12T12:40:18.316-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [...] [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) time=2025-04-12T12:40:18.316-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2025-04-12T12:40:18.316-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-12T12:40:18.401-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.2 GiB" [GIN] 2025/04/12 - 12:40:22 | 200 | 73.935µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/12 - 12:40:22 | 200 | 23.017086ms | 127.0.0.1 | POST "/api/show" time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:40:22.745-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:40:22.745-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=10978787328 required="6.2 GiB" time=2025-04-12T12:40:22.808-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.0 GiB" free_swap="8.0 GiB" time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:40:22.808-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:40:22.808-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --threads 32 --parallel 4 --port 33883" time=2025-04-12T12:40:23.003-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-12T12:40:23.003-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-12T12:40:23.014-04:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-04-12T12:40:23.039-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-04-12T12:40:23.040-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:33883" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) time=2025-04-12T12:40:23.255-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 4437.80 MiB llama_init_from_model: n_seq_max = 4 llama_init_from_model: n_ctx = 8192 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_init_from_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_init_from_model: CPU output buffer size = 2.02 MiB llama_init_from_model: CPU compute buffer size = 560.01 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 time=2025-04-12T12:40:23.757-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds" [GIN] 2025/04/12 - 12:40:23 | 200 | 1.136202653s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/04/12 - 12:40:41 | 200 | 13.837908528s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/04/12 - 12:45:36 | 200 | 43.858µs | 127.0.0.1 | GET "/api/version" time=2025-04-12T12:45:46.955-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.103440919 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T12:45:47.205-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.354226848 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T12:45:47.455-04:00 level=WARN source=sched.go:648 msg="gpu VRAM usage didn't recover within timeout" seconds=5.603979566 model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa 2025/04/12 12:57:06 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/var/lib/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-04-12T12:57:06.234-04:00 level=INFO source=images.go:458 msg="total blobs: 16" time=2025-04-12T12:57:06.235-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [...] [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) time=2025-04-12T12:57:06.235-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2025-04-12T12:57:06.235-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-12T12:57:06.318-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.2 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.4 GiB" [GIN] 2025/04/12 - 12:57:10 | 200 | 72.973µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/12 - 12:57:10 | 200 | 22.671641ms | 127.0.0.1 | POST "/api/show" time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:57:10.739-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:57:10.739-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11136401408 required="6.2 GiB" time=2025-04-12T12:57:10.806-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.1 GiB" free_swap="8.0 GiB" time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:57:10.806-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:57:10.806-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --threads 32 --parallel 4 --port 33305" time=2025-04-12T12:57:10.963-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-12T12:57:10.963-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-12T12:57:10.972-04:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-04-12T12:57:10.997-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-04-12T12:57:10.998-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:33305" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) time=2025-04-12T12:57:11.215-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 4437.80 MiB llama_init_from_model: n_seq_max = 4 llama_init_from_model: n_ctx = 8192 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1 llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_init_from_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_init_from_model: CPU output buffer size = 2.02 MiB llama_init_from_model: CPU compute buffer size = 560.01 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 time=2025-04-12T12:57:11.717-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds" [GIN] 2025/04/12 - 12:57:11 | 200 | 1.083427103s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/04/12 - 12:57:27 | 200 | 4.652757822s | 127.0.0.1 | POST "/api/chat" ``` ```shell > cat ollama_debug.log [...] time=2025-04-12T12:24:13.002-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="242.3 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="242.2 GiB" now.free_swap="8.0 GiB" initializing /usr/lib64/libcuda.so.535.230.02 dlsym: cuInit - 0x7fa940cc2470 dlsym: cuDriverGetVersion - 0x7fa940cc2490 dlsym: cuDeviceGetCount - 0x7fa940cc24d0 dlsym: cuDeviceGet - 0x7fa940cc24b0 dlsym: cuDeviceGetAttribute - 0x7fa940cc25b0 dlsym: cuDeviceGetUuid - 0x7fa940cc2510 dlsym: cuDeviceGetName - 0x7fa940cc24f0 dlsym: cuCtxCreate_v3 - 0x7fa940cca170 dlsym: cuMemGetInfo_v2 - 0x7fa940cd5640 dlsym: cuCtxDestroy - 0x7fa940d24640 calling cuInit calling cuDriverGetVersion raw version 0x2ef4 CUDA driver version: 12.2 calling cuDeviceGetCount device count 1 time=2025-04-12T12:24:13.077-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.4 GiB" now.total="10.9 GiB" now.free="10.4 GiB" now.used="506.1 MiB" releasing cuda driver library time=2025-04-12T12:24:13.077-04:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-04-12T12:24:13.115-04:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T12:24:13.115-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.4 GiB]" time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:24:13.115-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:24:13.115-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11181293568 required="6.2 GiB" time=2025-04-12T12:24:13.116-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="242.2 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="242.2 GiB" now.free_swap="8.0 GiB" initializing /usr/lib64/libcuda.so.535.230.02 dlsym: cuInit - 0x7fa940cc2470 dlsym: cuDriverGetVersion - 0x7fa940cc2490 dlsym: cuDeviceGetCount - 0x7fa940cc24d0 dlsym: cuDeviceGet - 0x7fa940cc24b0 dlsym: cuDeviceGetAttribute - 0x7fa940cc25b0 dlsym: cuDeviceGetUuid - 0x7fa940cc2510 dlsym: cuDeviceGetName - 0x7fa940cc24f0 dlsym: cuCtxCreate_v3 - 0x7fa940cca170 dlsym: cuMemGetInfo_v2 - 0x7fa940cd5640 dlsym: cuCtxDestroy - 0x7fa940d24640 calling cuInit calling cuDriverGetVersion raw version 0x2ef4 CUDA driver version: 12.2 calling cuDeviceGetCount device count 1 time=2025-04-12T12:24:13.193-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.4 GiB" now.total="10.9 GiB" now.free="10.4 GiB" now.used="506.1 MiB" releasing cuda driver library time=2025-04-12T12:24:13.193-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="242.2 GiB" free_swap="8.0 GiB" time=2025-04-12T12:24:13.193-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.4 GiB]" time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T12:24:13.193-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T12:24:13.193-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2025-04-12T12:24:13.195-04:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG [...trunc...] load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --verbose --threads 32 --parallel 4 --port 41227" time=2025-04-12T12:24:13.351-04:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_LAUNCH_BLOCKING=1 CUDA_CACHE_PATH=/home/sysop/.cache/nv LD_LIBRARY_PATH=/opt/cuda/lib64:/usr/lib64:/opt/cuda/lib64:/usr/lib64::/usr/bin PATH=/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/sysop/bin:/home/sysop/.local/bin CUDA_VISIBLE_DEVICES=GPU-41aae151-e334-6117-350f-1ab006f81f09]" time=2025-04-12T12:24:13.351-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-12T12:24:13.351-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-12T12:24:13.362-04:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64 time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64 time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64 time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64 time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/home/sysop time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin time=2025-04-12T12:24:13.388-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-04-12T12:24:13.389-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:41227" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG [...trunc...] load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU load_tensors: layer 1 assigned to device CPU [...] load_tensors: layer 30 assigned to device CPU load_tensors: layer 31 assigned to device CPU load_tensors: layer 32 assigned to device CPU time=2025-04-12T12:24:13.603-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 4437.80 MiB llama_init_from_model: n_seq_max = 4 llama_init_from_model: n_ctx = 8192 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 [...] llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 time=2025-04-12T12:24:13.854-04:00 level=DEBUG source=server.go:625 msg="model load progress 1.00" llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_init_from_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_init_from_model: CPU output buffer size = 2.02 MiB llama_init_from_model: CPU compute buffer size = 560.01 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 time=2025-04-12T12:24:14.106-04:00 level=INFO source=server.go:619 msg="llama runner started in 0.75 seconds" time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2025/04/12 - 12:24:14 | 200 | 1.128327651s | 127.0.0.1 | POST "/api/generate" time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:468 msg="context for request finished" time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2025-04-12T12:24:14.106-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2025-04-12T12:24:19.992-04:00 level=DEBUG source=sched.go:577 msg="evaluating already loaded" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T12:24:19.992-04:00 level=DEBUG source=routes.go:1522 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nwrite 1 random sentence.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2025-04-12T12:24:19.993-04:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=16 used=0 remaining=16 [GIN] 2025/04/12 - 12:24:20 | 200 | 973.844479ms | 127.0.0.1 | POST "/api/chat" time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:409 msg="context for request finished" time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2025-04-12T12:24:20.938-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.0.0
GiteaMirror added the bug label 2026-04-29 02:25:11 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 12, 2025):

You left out the bit of the log that shows model loading. Instead of selecting bits with grep, just post the whole thing.

<!-- gh-comment-id:2798924766 --> @rick-github commented on GitHub (Apr 12, 2025): You left out the bit of the log that shows model loading. Instead of selecting bits with grep, just post the whole thing.
Author
Owner

@nullagit commented on GitHub (Apr 12, 2025):

You left out the bit of the log that shows model loading. Instead of selecting bits with grep, just post the whole thing.

Updated the OP.

<!-- gh-comment-id:2798929550 --> @nullagit commented on GitHub (Apr 12, 2025): > You left out the bit of the log that shows model loading. Instead of selecting bits with grep, just post the whole thing. Updated the OP.
Author
Owner

@rick-github commented on GitHub (Apr 12, 2025):

time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin
time=2025-04-12T12:24:13.388-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc)

Looks like it didn't find any GPU backends. What's the output of

find /usr/lib/ollama /usr/local/lib/ollama

Also set OLLAMA_DEBUG=1 and re-do the model load, there will be more information about backend loading.

<!-- gh-comment-id:2798932035 --> @rick-github commented on GitHub (Apr 12, 2025): ``` time=2025-04-12T12:24:13.363-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin time=2025-04-12T12:24:13.388-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 compiler=cgo(gcc) ``` Looks like it didn't find any GPU backends. What's the output of ``` find /usr/lib/ollama /usr/local/lib/ollama ``` Also set `OLLAMA_DEBUG=1` and re-do the model load, there will be more information about backend loading.
Author
Owner

@nullagit commented on GitHub (Apr 12, 2025):

Looks like it didn't find any GPU backends. What's the output of

find /usr/lib/ollama /usr/local/lib/ollama
> sudo find /usr/lib/ollama /usr/local/lib/ollama /usr/lib64/ollama
find: ‘/usr/lib/ollama’: No such file or directory
find: ‘/usr/local/lib/ollama’: No such file or directory
/usr/lib64/ollama
/usr/lib64/ollama/libggml-cpu-skylakex.so
/usr/lib64/ollama/cuda_v12
/usr/lib64/ollama/cuda_v12/libcudart.so.12
/usr/lib64/ollama/cuda_v12/libcudart.so.12.8.90
/usr/lib64/ollama/cuda_v12/libcublas.so.12.8.4.1
/usr/lib64/ollama/cuda_v12/libcublasLt.so.12
/usr/lib64/ollama/cuda_v12/libcublas.so.12
/usr/lib64/ollama/cuda_v12/libcublasLt.so.12.8.4.1
/usr/lib64/ollama/cuda_v12/libggml-cuda.so
/usr/lib64/ollama/libggml-base.so
/usr/lib64/ollama/libggml-cpu-icelake.so
/usr/lib64/ollama/libggml-cpu-haswell.so
/usr/lib64/ollama/libggml-cpu-sandybridge.so

Also set OLLAMA_DEBUG=1 and re-do the model load, there will be more information about backend loading.

> OLLAMA_DEBUG=1 ollama serve 2>&1 | tee ollama_debug.log
2025/04/12 14:38:16 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/sysop/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-04-12T14:38:16.455-04:00 level=INFO source=images.go:458 msg="total blobs: 10"
time=2025-04-12T14:38:16.455-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
time=2025-04-12T14:38:16.455-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2025-04-12T14:38:16.455-04:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-04-12T14:38:16.455-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/bin/libcuda.so* /opt/cuda/lib64/libcuda.so* /usr/lib64/libcuda.so* /opt/cuda/lib64/libcuda.so* /usr/lib64/libcuda.so* /home/sysop/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-04-12T14:38:16.465-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.570.133.07]
initializing /usr/lib64/libcuda.so.570.133.07
dlsym: cuInit - 0x7fe34fd0fe70
dlsym: cuDriverGetVersion - 0x7fe34fd0fe90
dlsym: cuDeviceGetCount - 0x7fe34fd0fed0
dlsym: cuDeviceGet - 0x7fe34fd0feb0
dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0
dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10
dlsym: cuDeviceGetName - 0x7fe34fd0fef0
dlsym: cuCtxCreate_v3 - 0x7fe34fd10190
dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910
dlsym: cuCtxDestroy - 0x7fe34fd6eab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-04-12T14:38:16.474-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib64/libcuda.so.570.133.07
[GPU-41aae151-e334-6117-350f-1ab006f81f09] CUDA totalMem 11162 mb
[GPU-41aae151-e334-6117-350f-1ab006f81f09] CUDA freeMem 10486 mb
[GPU-41aae151-e334-6117-350f-1ab006f81f09] Compute Capability 6.1
time=2025-04-12T14:38:16.540-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-04-12T14:38:16.540-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.2 GiB"
[GIN] 2025/04/12 - 14:38:25 | 200 |      68.315µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/12 - 14:38:25 | 200 |   50.904042ms |       127.0.0.1 | POST     "/api/show"
time=2025-04-12T14:38:25.879-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="241.7 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="241.8 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib64/libcuda.so.570.133.07
dlsym: cuInit - 0x7fe34fd0fe70
dlsym: cuDriverGetVersion - 0x7fe34fd0fe90
dlsym: cuDeviceGetCount - 0x7fe34fd0fed0
dlsym: cuDeviceGet - 0x7fe34fd0feb0
dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0
dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10
dlsym: cuDeviceGetName - 0x7fe34fd0fef0
dlsym: cuCtxCreate_v3 - 0x7fe34fd10190
dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910
dlsym: cuCtxDestroy - 0x7fe34fd6eab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-04-12T14:38:25.946-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.2 GiB" now.total="10.9 GiB" now.free="10.3 GiB" now.used="666.2 MiB"
releasing cuda driver library
time=2025-04-12T14:38:25.946-04:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-04-12T14:38:25.968-04:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T14:38:25.968-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.3 GiB]"
time=2025-04-12T14:38:25.968-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T14:38:25.969-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T14:38:25.969-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T14:38:25.969-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11006115840 required="6.2 GiB"
time=2025-04-12T14:38:25.969-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="241.8 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="241.7 GiB" now.free_swap="8.0 GiB"
initializing /usr/lib64/libcuda.so.570.133.07
dlsym: cuInit - 0x7fe34fd0fe70
dlsym: cuDriverGetVersion - 0x7fe34fd0fe90
dlsym: cuDeviceGetCount - 0x7fe34fd0fed0
dlsym: cuDeviceGet - 0x7fe34fd0feb0
dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0
dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10
dlsym: cuDeviceGetName - 0x7fe34fd0fef0
dlsym: cuCtxCreate_v3 - 0x7fe34fd10190
dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910
dlsym: cuCtxDestroy - 0x7fe34fd6eab0
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 1
time=2025-04-12T14:38:26.029-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.3 GiB" now.total="10.9 GiB" now.free="10.3 GiB" now.used="666.2 MiB"
releasing cuda driver library
time=2025-04-12T14:38:26.029-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="241.7 GiB" free_swap="8.0 GiB"
time=2025-04-12T14:38:26.029-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.3 GiB]"
time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0
time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128
time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128
time=2025-04-12T14:38:26.030-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2025-04-12T14:38:26.033-04:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG
load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG
[...]
load: control token: 128063 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-04-12T14:38:26.182-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --verbose --threads 32 --parallel 4 --port 36725"
time=2025-04-12T14:38:26.182-04:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_CACHE_PATH=/home/sysop/.cache/nv LD_LIBRARY_PATH=/opt/cuda/lib64:/usr/lib64:/opt/cuda/lib64:/usr/lib64::/usr/bin PATH=/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/sysop/bin:/home/sysop/.local/bin CUDA_VISIBLE_DEVICES=GPU-41aae151-e334-6117-350f-1ab006f81f09]"
time=2025-04-12T14:38:26.182-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-12T14:38:26.182-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-12T14:38:26.183-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-12T14:38:26.193-04:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/home/sysop
time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin
time=2025-04-12T14:38:26.223-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-12T14:38:26.224-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:36725"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG
load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG
[...]
load: control token: 128218 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
[...]
load_tensors: layer  30 assigned to device CPU
load_tensors: layer  31 assigned to device CPU
load_tensors: layer  32 assigned to device CPU
time=2025-04-12T14:38:26.434-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size =  4437.80 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
[...]
llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024
time=2025-04-12T14:38:28.693-04:00 level=DEBUG source=server.go:625 msg="model load progress 1.00"
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_init_from_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.02 MiB
llama_init_from_model:        CPU compute buffer size =   560.01 MiB
llama_init_from_model: graph nodes  = 1030
llama_init_from_model: graph splits = 1
time=2025-04-12T14:38:28.944-04:00 level=INFO source=server.go:619 msg="llama runner started in 2.76 seconds"
time=2025-04-12T14:38:28.944-04:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
[GIN] 2025/04/12 - 14:38:28 | 200 |  3.084105429s |       127.0.0.1 | POST     "/api/generate"
time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:468 msg="context for request finished"
time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
time=2025-04-12T14:38:34.227-04:00 level=DEBUG source=sched.go:577 msg="evaluating already loaded" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2025-04-12T14:38:34.227-04:00 level=DEBUG source=routes.go:1522 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nthis is a test<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2025-04-12T14:38:34.228-04:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=14 used=0 remaining=14
[GIN] 2025/04/12 - 14:38:35 | 200 |   1.01890713s |       127.0.0.1 | POST     "/api/chat"
time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:409 msg="context for request finished"
time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s
time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0
<!-- gh-comment-id:2798948404 --> @nullagit commented on GitHub (Apr 12, 2025): > Looks like it didn't find any GPU backends. What's the output of > > ``` > find /usr/lib/ollama /usr/local/lib/ollama > ``` ```shell > sudo find /usr/lib/ollama /usr/local/lib/ollama /usr/lib64/ollama find: ‘/usr/lib/ollama’: No such file or directory find: ‘/usr/local/lib/ollama’: No such file or directory /usr/lib64/ollama /usr/lib64/ollama/libggml-cpu-skylakex.so /usr/lib64/ollama/cuda_v12 /usr/lib64/ollama/cuda_v12/libcudart.so.12 /usr/lib64/ollama/cuda_v12/libcudart.so.12.8.90 /usr/lib64/ollama/cuda_v12/libcublas.so.12.8.4.1 /usr/lib64/ollama/cuda_v12/libcublasLt.so.12 /usr/lib64/ollama/cuda_v12/libcublas.so.12 /usr/lib64/ollama/cuda_v12/libcublasLt.so.12.8.4.1 /usr/lib64/ollama/cuda_v12/libggml-cuda.so /usr/lib64/ollama/libggml-base.so /usr/lib64/ollama/libggml-cpu-icelake.so /usr/lib64/ollama/libggml-cpu-haswell.so /usr/lib64/ollama/libggml-cpu-sandybridge.so ``` > Also set `OLLAMA_DEBUG=1` and re-do the model load, there will be more information about backend loading. ```shell > OLLAMA_DEBUG=1 ollama serve 2>&1 | tee ollama_debug.log 2025/04/12 14:38:16 routes.go:1231: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/sysop/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-04-12T14:38:16.455-04:00 level=INFO source=images.go:458 msg="total blobs: 10" time=2025-04-12T14:38:16.455-04:00 level=INFO source=images.go:465 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) time=2025-04-12T14:38:16.455-04:00 level=INFO source=routes.go:1298 msg="Listening on 127.0.0.1:11434 (version 0.0.0)" time=2025-04-12T14:38:16.455-04:00 level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-04-12T14:38:16.455-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-04-12T14:38:16.459-04:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/bin/libcuda.so* /opt/cuda/lib64/libcuda.so* /usr/lib64/libcuda.so* /opt/cuda/lib64/libcuda.so* /usr/lib64/libcuda.so* /home/sysop/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-04-12T14:38:16.465-04:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.570.133.07] initializing /usr/lib64/libcuda.so.570.133.07 dlsym: cuInit - 0x7fe34fd0fe70 dlsym: cuDriverGetVersion - 0x7fe34fd0fe90 dlsym: cuDeviceGetCount - 0x7fe34fd0fed0 dlsym: cuDeviceGet - 0x7fe34fd0feb0 dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0 dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10 dlsym: cuDeviceGetName - 0x7fe34fd0fef0 dlsym: cuCtxCreate_v3 - 0x7fe34fd10190 dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910 dlsym: cuCtxDestroy - 0x7fe34fd6eab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-04-12T14:38:16.474-04:00 level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib64/libcuda.so.570.133.07 [GPU-41aae151-e334-6117-350f-1ab006f81f09] CUDA totalMem 11162 mb [GPU-41aae151-e334-6117-350f-1ab006f81f09] CUDA freeMem 10486 mb [GPU-41aae151-e334-6117-350f-1ab006f81f09] Compute Capability 6.1 time=2025-04-12T14:38:16.540-04:00 level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-04-12T14:38:16.540-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-41aae151-e334-6117-350f-1ab006f81f09 library=cuda variant=v12 compute=6.1 driver=12.8 name="NVIDIA GeForce GTX 1080 Ti" total="10.9 GiB" available="10.2 GiB" [GIN] 2025/04/12 - 14:38:25 | 200 | 68.315µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/12 - 14:38:25 | 200 | 50.904042ms | 127.0.0.1 | POST "/api/show" time=2025-04-12T14:38:25.879-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="241.7 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="241.8 GiB" now.free_swap="8.0 GiB" initializing /usr/lib64/libcuda.so.570.133.07 dlsym: cuInit - 0x7fe34fd0fe70 dlsym: cuDriverGetVersion - 0x7fe34fd0fe90 dlsym: cuDeviceGetCount - 0x7fe34fd0fed0 dlsym: cuDeviceGet - 0x7fe34fd0feb0 dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0 dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10 dlsym: cuDeviceGetName - 0x7fe34fd0fef0 dlsym: cuCtxCreate_v3 - 0x7fe34fd10190 dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910 dlsym: cuCtxDestroy - 0x7fe34fd6eab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-04-12T14:38:25.946-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.2 GiB" now.total="10.9 GiB" now.free="10.3 GiB" now.used="666.2 MiB" releasing cuda driver library time=2025-04-12T14:38:25.946-04:00 level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-04-12T14:38:25.968-04:00 level=DEBUG source=sched.go:226 msg="loading first model" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T14:38:25.968-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.3 GiB]" time=2025-04-12T14:38:25.968-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T14:38:25.969-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T14:38:25.969-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T14:38:25.969-04:00 level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 parallel=4 available=11006115840 required="6.2 GiB" time=2025-04-12T14:38:25.969-04:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.2 GiB" before.free="241.8 GiB" before.free_swap="8.0 GiB" now.total="251.2 GiB" now.free="241.7 GiB" now.free_swap="8.0 GiB" initializing /usr/lib64/libcuda.so.570.133.07 dlsym: cuInit - 0x7fe34fd0fe70 dlsym: cuDriverGetVersion - 0x7fe34fd0fe90 dlsym: cuDeviceGetCount - 0x7fe34fd0fed0 dlsym: cuDeviceGet - 0x7fe34fd0feb0 dlsym: cuDeviceGetAttribute - 0x7fe34fd0ffb0 dlsym: cuDeviceGetUuid - 0x7fe34fd0ff10 dlsym: cuDeviceGetName - 0x7fe34fd0fef0 dlsym: cuCtxCreate_v3 - 0x7fe34fd10190 dlsym: cuMemGetInfo_v2 - 0x7fe34fd10910 dlsym: cuCtxDestroy - 0x7fe34fd6eab0 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 1 time=2025-04-12T14:38:26.029-04:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-41aae151-e334-6117-350f-1ab006f81f09 name="NVIDIA GeForce GTX 1080 Ti" overhead="0 B" before.total="10.9 GiB" before.free="10.3 GiB" now.total="10.9 GiB" now.free="10.3 GiB" now.used="666.2 MiB" releasing cuda driver library time=2025-04-12T14:38:26.029-04:00 level=INFO source=server.go:105 msg="system memory" total="251.2 GiB" free="241.7 GiB" free_swap="8.0 GiB" time=2025-04-12T14:38:26.029-04:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.3 GiB]" time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.vision.block_count default=0 time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.key_length default=128 time=2025-04-12T14:38:26.029-04:00 level=WARN source=ggml.go:152 msg="key not found" key=llama.attention.value_length default=128 time=2025-04-12T14:38:26.030-04:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.2 GiB" memory.required.partial="6.2 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" time=2025-04-12T14:38:26.033-04:00 level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG [...] load: control token: 128063 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-04-12T14:38:26.182-04:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --verbose --threads 32 --parallel 4 --port 36725" time=2025-04-12T14:38:26.182-04:00 level=DEBUG source=server.go:423 msg=subprocess environment="[CUDA_CACHE_PATH=/home/sysop/.cache/nv LD_LIBRARY_PATH=/opt/cuda/lib64:/usr/lib64:/opt/cuda/lib64:/usr/lib64::/usr/bin PATH=/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/sysop/bin:/home/sysop/.local/bin CUDA_VISIBLE_DEVICES=GPU-41aae151-e334-6117-350f-1ab006f81f09]" time=2025-04-12T14:38:26.182-04:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-12T14:38:26.182-04:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-12T14:38:26.183-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-12T14:38:26.193-04:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64 time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64 time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/opt/cuda/lib64 time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/lib64 time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/home/sysop time=2025-04-12T14:38:26.193-04:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/bin time=2025-04-12T14:38:26.223-04:00 level=INFO source=ggml.go:109 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-04-12T14:38:26.224-04:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:36725" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 128255 '<|reserved_special_token_250|>' is not marked as EOG load: control token: 128254 '<|reserved_special_token_249|>' is not marked as EOG load: control token: 128253 '<|reserved_special_token_248|>' is not marked as EOG [...] load: control token: 128218 '<|reserved_special_token_213|>' is not marked as EOG load: control token: 128063 '<|reserved_special_token_58|>' is not marked as EOG load: control token: 128117 '<|reserved_special_token_112|>' is not marked as EOG load: control token: 128011 '<|reserved_special_token_6|>' is not marked as EOG load: control token: 128022 '<|reserved_special_token_17|>' is not marked as EOG load: control token: 128123 '<|reserved_special_token_118|>' is not marked as EOG load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU load_tensors: layer 1 assigned to device CPU [...] load_tensors: layer 30 assigned to device CPU load_tensors: layer 31 assigned to device CPU load_tensors: layer 32 assigned to device CPU time=2025-04-12T14:38:26.434-04:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 4437.80 MiB llama_init_from_model: n_seq_max = 4 llama_init_from_model: n_ctx = 8192 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 2048 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 500000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1 llama_kv_cache_init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 [...] llama_kv_cache_init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 llama_kv_cache_init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024 time=2025-04-12T14:38:28.693-04:00 level=DEBUG source=server.go:625 msg="model load progress 1.00" llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB llama_init_from_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_init_from_model: CPU output buffer size = 2.02 MiB llama_init_from_model: CPU compute buffer size = 560.01 MiB llama_init_from_model: graph nodes = 1030 llama_init_from_model: graph splits = 1 time=2025-04-12T14:38:28.944-04:00 level=INFO source=server.go:619 msg="llama runner started in 2.76 seconds" time=2025-04-12T14:38:28.944-04:00 level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2025/04/12 - 14:38:28 | 200 | 3.084105429s | 127.0.0.1 | POST "/api/generate" time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:468 msg="context for request finished" time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2025-04-12T14:38:28.945-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 time=2025-04-12T14:38:34.227-04:00 level=DEBUG source=sched.go:577 msg="evaluating already loaded" model=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2025-04-12T14:38:34.227-04:00 level=DEBUG source=routes.go:1522 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nthis is a test<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2025-04-12T14:38:34.228-04:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=14 used=0 remaining=14 [GIN] 2025/04/12 - 14:38:35 | 200 | 1.01890713s | 127.0.0.1 | POST "/api/chat" time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:409 msg="context for request finished" time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa duration=5m0s time=2025-04-12T14:38:35.210-04:00 level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/sysop/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa refCount=0 ```
Author
Owner

@nullagit commented on GitHub (Apr 12, 2025):

If it helps, here's the entire ollama ebuild from the guru gentoo overlay:

# cat ollama-9999.ebuild 
# Copyright 2024-2025 Gentoo Authors
# Distributed under the terms of the GNU General Public License v2

EAPI=8

# supports ROCM/HIP >=5.5, but we define 6.1 due to the eclass
ROCM_VERSION=6.1
inherit cuda rocm
inherit cmake
inherit go-module systemd toolchain-funcs

DESCRIPTION="Get up and running with Llama 3, Mistral, Gemma, and other language models."
HOMEPAGE="https://ollama.com"

if [[ ${PV} == *9999* ]]; then
	inherit git-r3
	EGIT_REPO_URI="https://github.com/ollama/ollama.git"
else
	SRC_URI="
		https://github.com/ollama/${PN}/archive/refs/tags/v${PV}.tar.gz -> ${P}.gh.tar.gz
		https://github.com/negril/gentoo-overlay-vendored/raw/refs/heads/blobs/${P}-vendor.tar.xz
	"
	KEYWORDS="~amd64"
fi

LICENSE="MIT"
SLOT="0"

X86_CPU_FLAGS=(
	avx
	f16c
	avx2
	fma3
	avx512f
	avx512vbmi
	avx512_vnni
	avx512_bf16
	avx_vnni
	amx_tile
	amx_int8
)
CPU_FLAGS=( "${X86_CPU_FLAGS[@]/#/cpu_flags_x86_}" )
IUSE="${CPU_FLAGS[*]} cuda blas mkl rocm"
# IUSE+=" opencl vulkan"

COMMON_DEPEND="
	cuda? (
		dev-util/nvidia-cuda-toolkit:=
	)
	blas? (
		!mkl? (
			virtual/blas
		)
		mkl? (
			sci-libs/mkl
		)
	)
	rocm? (
		>=sci-libs/hipBLAS-5.5:=[${ROCM_USEDEP}]
	)
"

DEPEND="
	${COMMON_DEPEND}
	>=dev-lang/go-1.23.4
"

RDEPEND="
	${COMMON_DEPEND}
	acct-group/${PN}
	acct-user/${PN}
"

PATCHES=(
	"${FILESDIR}/${PN}-0.6.3-use-GNUInstallDirs.patch"
)

src_unpack() {
	if [[ "${PV}" == *9999* ]]; then
		git-r3_src_unpack
		go-module_live_vendor
	else
		go-module_src_unpack
	fi
}

src_prepare() {
	cmake_src_prepare

	sed \
		-e "/set(GGML_CCACHE/s/ON/OFF/g" \
		-e "/PRE_INCLUDE_REGEXES.*cu/d" \
		-e "/PRE_INCLUDE_REGEXES.*hip/d" \
		-i CMakeLists.txt || die sed

	sed \
		-e "s/-O3/$(get-flag O)/g" \
		-i ml/backend/ggml/ggml/src/ggml-cpu/cpu.go || die sed

	if use amd64; then
		if ! use cpu_flags_x86_avx; then
			sed -e "/ggml_add_cpu_backend_variant(sandybridge/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die
			# AVX)
		fi
		if
			! use cpu_flags_x86_avx ||
			! use cpu_flags_x86_f16c ||
			! use cpu_flags_x86_avx2 ||
			! use cpu_flags_x86_fma3; then
			sed -e "/ggml_add_cpu_backend_variant(haswell/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die
			# AVX F16C AVX2 FMA)
		fi
		if
			! use cpu_flags_x86_avx ||
			! use cpu_flags_x86_f16c ||
			! use cpu_flags_x86_avx2 ||
			! use cpu_flags_x86_fma3 ||
			! use cpu_flags_x86_avx512f; then
			sed -e "/ggml_add_cpu_backend_variant(skylakex/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt ||  die
			# AVX F16C AVX2 FMA AVX512)
		fi
		if
			! use cpu_flags_x86_avx ||
			! use cpu_flags_x86_f16c ||
			! use cpu_flags_x86_avx2 ||
			! use cpu_flags_x86_fma3 ||
			! use cpu_flags_x86_avx512f ||
			! use cpu_flags_x86_avx512vbmi ||
			! use cpu_flags_x86_avx512_vnni; then
			sed -e "/ggml_add_cpu_backend_variant(icelake/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die
			# AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
		fi
		if
			! use cpu_flags_x86_avx ||
			! use cpu_flags_x86_f16c ||
			! use cpu_flags_x86_avx2 ||
			! use cpu_flags_x86_fma3 ||
			! use cpu_flags_x86_avx_vnni; then
			sed -e "/ggml_add_cpu_backend_variant(alderlake/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die
			# AVX F16C AVX2 FMA AVX_VNNI)
		fi

		if
			! use cpu_flags_x86_avx ||
			! use cpu_flags_x86_f16c ||
			! use cpu_flags_x86_avx2 ||
			! use cpu_flags_x86_fma3 ||
			! use cpu_flags_x86_avx512f ||
			! use cpu_flags_x86_avx512vbmi ||
			! use cpu_flags_x86_avx512_vnni ||
			! use cpu_flags_x86_avx512_bf16 ||
			! use cpu_flags_x86_amx_tile ||
			! use cpu_flags_x86_amx_int8 ; then
			sed -e "/ggml_add_cpu_backend_variant(sapphirerapids/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die
			#AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8)
		fi
		: # ml/backend/ggml/ggml/src/CMakeLists.txt
	fi

	# default
	# return
	if use cuda; then
		cuda_src_prepare
	fi

	if use rocm; then
		# --hip-version gets appended to the compile flags which isn't a known flag.
		# This causes rocm builds to fail because -Wunused-command-line-argument is turned on.
		# Use nuclear option to fix this.
		# Disable -Werror's from go modules.
		find "${S}" -name ".go" -exec sed -i "s/ -Werror / /g" {} + || die
	fi
}

src_configure() {
	local mycmakeargs=(
		-DGGML_CCACHE="no"

		# -DGGML_CPU="yes"
		-DGGML_BLAS="$(usex blas)"
		# -DGGML_CUDA="$(usex cuda)"
		# -DGGML_HIP="$(usex rocm)"

		# -DGGML_METAL="yes" # apple
		# missing from ml/backend/ggml/ggml/src/
		# -DGGML_CANN="yes"
		# -DGGML_MUSA="yes"
		# -DGGML_RPC="yes"
		# -DGGML_SYCL="yes"
		# -DGGML_KOMPUTE="$(usex kompute)"
		# -DGGML_OPENCL="$(usex opencl)"
		# -DGGML_VULKAN="$(usex vulkan)"
	)

	if use blas; then
		if use mkl; then
			mycmakeargs+=(
				-DGGML_BLAS_VENDOR="Intel"
			)
		else
			mycmakeargs+=(
				-DGGML_BLAS_VENDOR="Generic"
			)
		fi
	fi
	if use cuda; then
		local -x CUDAHOSTCXX CUDAHOSTLD
		CUDAHOSTCXX="$(cuda_gccdir)"
		CUDAHOSTLD="$(tc-getCXX)"

		cuda_add_sandbox -w
	else
		mycmakeargs+=(
			-DCMAKE_CUDA_COMPILER="NOTFOUND"
		)
	fi

	if use rocm; then
		mycmakeargs+=(
			-DCMAKE_HIP_PLATFORM="amd"
		)

		local -x HIP_ARCHS HIP_PATH
		HIP_ARCHS="$(get_amdgpu_flags)"
		HIP_PATH="${ESYSROOT}/usr"

		check_amdgpu
	else
		mycmakeargs+=(
			-DCMAKE_HIP_COMPILER="NOTFOUND"
		)
	fi

	cmake_src_configure

	# if ! use cuda && ! use rocm; then
	# 	# to configure and build only CPU variants
	# 	set -- cmake --preset Default "${mycmakeargs[@]}"
	# fi

	# if use cuda; then
	# 	# to configure and build only CUDA
	# 	set -- cmake --preset CUDA "${mycmakeargs[@]}"
	# fi

	# if use rocm; then
	# 	# to configure and build only ROCm
	# 	set -- cmake --preset ROCm "${mycmakeargs[@]}"
	# fi

	# echo "$@" >&2
	# "$@" || die -n "${*} failed"
}

src_compile() {
	ego build

	cmake_src_compile

	# if ! use cuda && ! use rocm; then
	# 	# to configure and build only CPU variants
	# 	set -- cmake --build --preset Default -j16
	# fi

	# if use cuda; then
	# 	# to configure and build only CUDA
	# 	set -- cmake --build --preset CUDA -j16
	# fi

	# if use rocm; then
	# 	# to configure and build only ROCm
	# 	set -- cmake --build --preset ROCm -j16
	# fi

	# echo "$@" >&2
	# "$@" || die -n "${*} failed"
}

src_install() {
	dobin ollama

	cmake_src_install

	newinitd "${FILESDIR}/ollama.init" "${PN}"
	newconfd "${FILESDIR}/ollama.confd" "${PN}"

	systemd_dounit "${FILESDIR}/ollama.service"
}

pkg_preinst() {
	keepdir /var/log/ollama
	fperms 750 /var/log/ollama
	fowners "${PN}:${PN}" /var/log/ollama
}

pkg_postinst() {
	if [[ -z ${REPLACING_VERSIONS} ]] ; then
		einfo "Quick guide:"
		einfo "\tollama serve"
		einfo "\tollama run llama3:70b"
		einfo
		einfo "See available models at https://ollama.com/library"
	fi
}
<!-- gh-comment-id:2798949103 --> @nullagit commented on GitHub (Apr 12, 2025): If it helps, here's the entire ollama ebuild from the guru gentoo overlay: ```shell # cat ollama-9999.ebuild # Copyright 2024-2025 Gentoo Authors # Distributed under the terms of the GNU General Public License v2 EAPI=8 # supports ROCM/HIP >=5.5, but we define 6.1 due to the eclass ROCM_VERSION=6.1 inherit cuda rocm inherit cmake inherit go-module systemd toolchain-funcs DESCRIPTION="Get up and running with Llama 3, Mistral, Gemma, and other language models." HOMEPAGE="https://ollama.com" if [[ ${PV} == *9999* ]]; then inherit git-r3 EGIT_REPO_URI="https://github.com/ollama/ollama.git" else SRC_URI=" https://github.com/ollama/${PN}/archive/refs/tags/v${PV}.tar.gz -> ${P}.gh.tar.gz https://github.com/negril/gentoo-overlay-vendored/raw/refs/heads/blobs/${P}-vendor.tar.xz " KEYWORDS="~amd64" fi LICENSE="MIT" SLOT="0" X86_CPU_FLAGS=( avx f16c avx2 fma3 avx512f avx512vbmi avx512_vnni avx512_bf16 avx_vnni amx_tile amx_int8 ) CPU_FLAGS=( "${X86_CPU_FLAGS[@]/#/cpu_flags_x86_}" ) IUSE="${CPU_FLAGS[*]} cuda blas mkl rocm" # IUSE+=" opencl vulkan" COMMON_DEPEND=" cuda? ( dev-util/nvidia-cuda-toolkit:= ) blas? ( !mkl? ( virtual/blas ) mkl? ( sci-libs/mkl ) ) rocm? ( >=sci-libs/hipBLAS-5.5:=[${ROCM_USEDEP}] ) " DEPEND=" ${COMMON_DEPEND} >=dev-lang/go-1.23.4 " RDEPEND=" ${COMMON_DEPEND} acct-group/${PN} acct-user/${PN} " PATCHES=( "${FILESDIR}/${PN}-0.6.3-use-GNUInstallDirs.patch" ) src_unpack() { if [[ "${PV}" == *9999* ]]; then git-r3_src_unpack go-module_live_vendor else go-module_src_unpack fi } src_prepare() { cmake_src_prepare sed \ -e "/set(GGML_CCACHE/s/ON/OFF/g" \ -e "/PRE_INCLUDE_REGEXES.*cu/d" \ -e "/PRE_INCLUDE_REGEXES.*hip/d" \ -i CMakeLists.txt || die sed sed \ -e "s/-O3/$(get-flag O)/g" \ -i ml/backend/ggml/ggml/src/ggml-cpu/cpu.go || die sed if use amd64; then if ! use cpu_flags_x86_avx; then sed -e "/ggml_add_cpu_backend_variant(sandybridge/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die # AVX) fi if ! use cpu_flags_x86_avx || ! use cpu_flags_x86_f16c || ! use cpu_flags_x86_avx2 || ! use cpu_flags_x86_fma3; then sed -e "/ggml_add_cpu_backend_variant(haswell/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die # AVX F16C AVX2 FMA) fi if ! use cpu_flags_x86_avx || ! use cpu_flags_x86_f16c || ! use cpu_flags_x86_avx2 || ! use cpu_flags_x86_fma3 || ! use cpu_flags_x86_avx512f; then sed -e "/ggml_add_cpu_backend_variant(skylakex/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die # AVX F16C AVX2 FMA AVX512) fi if ! use cpu_flags_x86_avx || ! use cpu_flags_x86_f16c || ! use cpu_flags_x86_avx2 || ! use cpu_flags_x86_fma3 || ! use cpu_flags_x86_avx512f || ! use cpu_flags_x86_avx512vbmi || ! use cpu_flags_x86_avx512_vnni; then sed -e "/ggml_add_cpu_backend_variant(icelake/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die # AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI) fi if ! use cpu_flags_x86_avx || ! use cpu_flags_x86_f16c || ! use cpu_flags_x86_avx2 || ! use cpu_flags_x86_fma3 || ! use cpu_flags_x86_avx_vnni; then sed -e "/ggml_add_cpu_backend_variant(alderlake/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die # AVX F16C AVX2 FMA AVX_VNNI) fi if ! use cpu_flags_x86_avx || ! use cpu_flags_x86_f16c || ! use cpu_flags_x86_avx2 || ! use cpu_flags_x86_fma3 || ! use cpu_flags_x86_avx512f || ! use cpu_flags_x86_avx512vbmi || ! use cpu_flags_x86_avx512_vnni || ! use cpu_flags_x86_avx512_bf16 || ! use cpu_flags_x86_amx_tile || ! use cpu_flags_x86_amx_int8 ; then sed -e "/ggml_add_cpu_backend_variant(sapphirerapids/s/^/# /g" -i ml/backend/ggml/ggml/src/CMakeLists.txt || die #AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI AVX512_BF16 AMX_TILE AMX_INT8) fi : # ml/backend/ggml/ggml/src/CMakeLists.txt fi # default # return if use cuda; then cuda_src_prepare fi if use rocm; then # --hip-version gets appended to the compile flags which isn't a known flag. # This causes rocm builds to fail because -Wunused-command-line-argument is turned on. # Use nuclear option to fix this. # Disable -Werror's from go modules. find "${S}" -name ".go" -exec sed -i "s/ -Werror / /g" {} + || die fi } src_configure() { local mycmakeargs=( -DGGML_CCACHE="no" # -DGGML_CPU="yes" -DGGML_BLAS="$(usex blas)" # -DGGML_CUDA="$(usex cuda)" # -DGGML_HIP="$(usex rocm)" # -DGGML_METAL="yes" # apple # missing from ml/backend/ggml/ggml/src/ # -DGGML_CANN="yes" # -DGGML_MUSA="yes" # -DGGML_RPC="yes" # -DGGML_SYCL="yes" # -DGGML_KOMPUTE="$(usex kompute)" # -DGGML_OPENCL="$(usex opencl)" # -DGGML_VULKAN="$(usex vulkan)" ) if use blas; then if use mkl; then mycmakeargs+=( -DGGML_BLAS_VENDOR="Intel" ) else mycmakeargs+=( -DGGML_BLAS_VENDOR="Generic" ) fi fi if use cuda; then local -x CUDAHOSTCXX CUDAHOSTLD CUDAHOSTCXX="$(cuda_gccdir)" CUDAHOSTLD="$(tc-getCXX)" cuda_add_sandbox -w else mycmakeargs+=( -DCMAKE_CUDA_COMPILER="NOTFOUND" ) fi if use rocm; then mycmakeargs+=( -DCMAKE_HIP_PLATFORM="amd" ) local -x HIP_ARCHS HIP_PATH HIP_ARCHS="$(get_amdgpu_flags)" HIP_PATH="${ESYSROOT}/usr" check_amdgpu else mycmakeargs+=( -DCMAKE_HIP_COMPILER="NOTFOUND" ) fi cmake_src_configure # if ! use cuda && ! use rocm; then # # to configure and build only CPU variants # set -- cmake --preset Default "${mycmakeargs[@]}" # fi # if use cuda; then # # to configure and build only CUDA # set -- cmake --preset CUDA "${mycmakeargs[@]}" # fi # if use rocm; then # # to configure and build only ROCm # set -- cmake --preset ROCm "${mycmakeargs[@]}" # fi # echo "$@" >&2 # "$@" || die -n "${*} failed" } src_compile() { ego build cmake_src_compile # if ! use cuda && ! use rocm; then # # to configure and build only CPU variants # set -- cmake --build --preset Default -j16 # fi # if use cuda; then # # to configure and build only CUDA # set -- cmake --build --preset CUDA -j16 # fi # if use rocm; then # # to configure and build only ROCm # set -- cmake --build --preset ROCm -j16 # fi # echo "$@" >&2 # "$@" || die -n "${*} failed" } src_install() { dobin ollama cmake_src_install newinitd "${FILESDIR}/ollama.init" "${PN}" newconfd "${FILESDIR}/ollama.confd" "${PN}" systemd_dounit "${FILESDIR}/ollama.service" } pkg_preinst() { keepdir /var/log/ollama fperms 750 /var/log/ollama fowners "${PN}:${PN}" /var/log/ollama } pkg_postinst() { if [[ -z ${REPLACING_VERSIONS} ]] ; then einfo "Quick guide:" einfo "\tollama serve" einfo "\tollama run llama3:70b" einfo einfo "See available models at https://ollama.com/library" fi } ```
Author
Owner

@rick-github commented on GitHub (Apr 12, 2025):

ollama finds backends relative to where the ollama binary is installed, by moving up one directory level and appending lib/ollama. Since the executable is /usr/bin/ollama, it expects to find the backends in /usr/lib/ollama. Since the backends are actually in /usr/lib64/ollama, ollama doesn't find them. The quickest workaround would be sudo ln -s /usr/lib64/ollama /usr/lib.

<!-- gh-comment-id:2799007606 --> @rick-github commented on GitHub (Apr 12, 2025): ollama finds backends relative to where the ollama binary is installed, by moving up one directory level and appending `lib/ollama`. Since the executable is `/usr/bin/ollama`, it expects to find the backends in `/usr/lib/ollama`. Since the backends are actually in `/usr/lib64/ollama`, ollama doesn't find them. The quickest workaround would be `sudo ln -s /usr/lib64/ollama /usr/lib`.
Author
Owner

@nullagit commented on GitHub (Apr 12, 2025):

The quickest workaround would be sudo ln -s /usr/lib64/ollama /usr/lib.

Amazing, that did it. Thank you.

<!-- gh-comment-id:2799010944 --> @nullagit commented on GitHub (Apr 12, 2025): > The quickest workaround would be `sudo ln -s /usr/lib64/ollama /usr/lib`. Amazing, that did it. Thank you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53238