compiling the source code but not use gpu to run model #6201

Open
opened 2025-11-12 13:25:22 -06:00 by GiteaMirror · 1 comment
Owner

Originally created by @BigaGrayWolf on GitHub (Mar 5, 2025).

What is the issue?

i download the ollama-brucemacd-ctx-shift-err.zip
follow the development.md and try to run the model
but the model is loaded to cpu.

Relevant log output

time=2025-03-05T02:26:12.765Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:12.897Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:13.016Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:13.134Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:13.134Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.4 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:13.251Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:13.375Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:13.493Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:13.617Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:13.617Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:13.739Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:13.874Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:13.993Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:14.112Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:14.113Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:14.237Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:14.359Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:14.478Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:14.595Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:14.595Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:14.712Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:14.837Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:14.958Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:15.083Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:15.083Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:15.202Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:15.329Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:15.450Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:15.581Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:15.582Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:15.704Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:15.828Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:15.947Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:16.068Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:16.068Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:16.187Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:16.309Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:16.432Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:16.551Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:16.551Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.205156345 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-03-05T02:26:16.551Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:16.551Z level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-03-05T02:26:16.551Z level=DEBUG source=sched.go:309 msg="ignoring unload event with no pending requests"
time=2025-03-05T02:26:16.674Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:16.802Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:16.922Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:17.041Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:17.041Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.695083481 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-03-05T02:26:17.041Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:26:17.158Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:17.282Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:26:17.401Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:26:17.520Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:26:17.520Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=6.174074521 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-03-05T02:28:04.347Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:28:04.496Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:28:04.629Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:28:04.762Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:28:04.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:28:04.933Z level=DEBUG source=sched.go:225 msg="loading first model" model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41
time=2025-03-05T02:28:04.933Z level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[31.4 GiB]"
time=2025-03-05T02:28:04.935Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-05T02:28:04.935Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-05T02:28:04.937Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 parallel=4 available=33766965248 required="21.5 GiB"
time=2025-03-05T02:28:04.938Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.4 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05
dlsym: cuInit - 0x7f011e927290
dlsym: cuDriverGetVersion - 0x7f011e9272b0
dlsym: cuDeviceGetCount - 0x7f011e9272f0
dlsym: cuDeviceGet - 0x7f011e9272d0
dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0
dlsym: cuDeviceGetUuid - 0x7f011e927330
dlsym: cuDeviceGetName - 0x7f011e927310
dlsym: cuCtxCreate_v3 - 0x7f011e92ea80
dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0
dlsym: cuCtxDestroy - 0x7f011e983a20
calling cuInit
calling cuDriverGetVersion
raw version 0x2b48
CUDA driver version: 11.8
calling cuDeviceGetCount
device count 4
time=2025-03-05T02:28:05.072Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:28:05.196Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB"
time=2025-03-05T02:28:05.333Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
time=2025-03-05T02:28:05.457Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB"
releasing cuda driver library
time=2025-03-05T02:28:05.458Z level=INFO source=server.go:97 msg="system memory" total="125.6 GiB" free="84.4 GiB" free_swap="0 B"
time=2025-03-05T02:28:05.458Z level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[31.4 GiB]"
time=2025-03-05T02:28:05.458Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-05T02:28:05.458Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-05T02:28:05.458Z level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-05T02:28:05.459Z level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible=[]
time=2025-03-05T02:28:05.460Z level=INFO source=server.go:380 msg="starting llama server" cmd="/data/ollama-brucemacd-ctx-shift-err/ollama runner --model /root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --verbose --threads 32 --parallel 4 --port 42115"
time=2025-03-05T02:28:05.460Z level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_VERSION=11.8.0 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/data/ollama-brucemacd-ctx-shift-err/build/lib/ollama PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin CUDA_VISIBLE_DEVICES=GPU-f65bf98e-50af-336a-642c-350d745a2ba4]"
time=2025-03-05T02:28:05.462Z level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-05T02:28:05.462Z level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-05T02:28:05.462Z level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-05T02:28:05.487Z level=INFO source=runner.go:938 msg="starting go runner"
time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:78 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib
time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:78 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64
time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=/data/ollama-brucemacd-ctx-shift-err/build/lib/ollama
ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-sandybridge.so score: 20
ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-haswell.so score: 55
ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-icelake.so score: 0
ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-alderlake.so score: 0
ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-skylakex.so score: 183
load_backend: loaded CPU backend from /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-skylakex.so
time=2025-03-05T02:28:05.510Z level=INFO source=runner.go:941 msg=system info="CPU : LLAMAFILE = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=32
time=2025-03-05T02:28:05.510Z level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:42115"
llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 32B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5
llama_model_loader: - kv   5:                         general.size_label str              = 32B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  12:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 15
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
time=2025-03-05T02:28:05.714Z level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 27648
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen2.5 32B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
load_tensors: layer   2 assigned to device CPU
load_tensors: layer   3 assigned to device CPU
load_tensors: layer   4 assigned to device CPU
load_tensors: layer   5 assigned to device CPU
load_tensors: layer   6 assigned to device CPU
load_tensors: layer   7 assigned to device CPU
load_tensors: layer   8 assigned to device CPU
load_tensors: layer   9 assigned to device CPU
load_tensors: layer  10 assigned to device CPU
load_tensors: layer  11 assigned to device CPU
load_tensors: layer  12 assigned to device CPU
load_tensors: layer  13 assigned to device CPU
load_tensors: layer  14 assigned to device CPU
load_tensors: layer  15 assigned to device CPU
load_tensors: layer  16 assigned to device CPU
load_tensors: layer  17 assigned to device CPU
load_tensors: layer  18 assigned to device CPU
load_tensors: layer  19 assigned to device CPU
load_tensors: layer  20 assigned to device CPU
load_tensors: layer  21 assigned to device CPU
load_tensors: layer  22 assigned to device CPU
load_tensors: layer  23 assigned to device CPU
load_tensors: layer  24 assigned to device CPU
load_tensors: layer  25 assigned to device CPU
load_tensors: layer  26 assigned to device CPU
load_tensors: layer  27 assigned to device CPU
load_tensors: layer  28 assigned to device CPU
load_tensors: layer  29 assigned to device CPU
load_tensors: layer  30 assigned to device CPU
load_tensors: layer  31 assigned to device CPU
load_tensors: layer  32 assigned to device CPU
load_tensors: layer  33 assigned to device CPU
load_tensors: layer  34 assigned to device CPU
load_tensors: layer  35 assigned to device CPU
load_tensors: layer  36 assigned to device CPU
load_tensors: layer  37 assigned to device CPU
load_tensors: layer  38 assigned to device CPU
load_tensors: layer  39 assigned to device CPU
load_tensors: layer  40 assigned to device CPU
load_tensors: layer  41 assigned to device CPU
load_tensors: layer  42 assigned to device CPU
load_tensors: layer  43 assigned to device CPU
load_tensors: layer  44 assigned to device CPU
load_tensors: layer  45 assigned to device CPU
load_tensors: layer  46 assigned to device CPU
load_tensors: layer  47 assigned to device CPU
load_tensors: layer  48 assigned to device CPU
load_tensors: layer  49 assigned to device CPU
load_tensors: layer  50 assigned to device CPU
load_tensors: layer  51 assigned to device CPU
load_tensors: layer  52 assigned to device CPU
load_tensors: layer  53 assigned to device CPU
load_tensors: layer  54 assigned to device CPU
load_tensors: layer  55 assigned to device CPU
load_tensors: layer  56 assigned to device CPU
load_tensors: layer  57 assigned to device CPU
load_tensors: layer  58 assigned to device CPU
load_tensors: layer  59 assigned to device CPU
load_tensors: layer  60 assigned to device CPU
load_tensors: layer  61 assigned to device CPU
load_tensors: layer  62 assigned to device CPU
load_tensors: layer  63 assigned to device CPU
load_tensors: layer  64 assigned to device CPU
load_tensors:   CPU_Mapped model buffer size = 18926.01 MiB

OS

ubuntu 22.04

GPU

root@b2b0c81a8393:/data/ollama-brucemacd-ctx-shift-err# nvidia-smi
Wed Mar 5 03:40:07 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:06:00.0 Off | 0 |
| N/A 36C P0 25W / 250W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:2F:00.0 Off | 0 |
| N/A 40C P0 36W / 250W | 4380MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 |
| N/A 36C P0 25W / 250W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
| N/A 33C P0 27W / 250W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

CPU

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz

Ollama version

I use a docker to build the environment
spacewalkerjp/nvidia_cuda_11.8.0-cudnn8-runtime-ubuntu22.04_updated4automatic1111:latest

Originally created by @BigaGrayWolf on GitHub (Mar 5, 2025). ### What is the issue? i download the ollama-brucemacd-ctx-shift-err.zip follow the development.md and try to run the model but the model is loaded to cpu. ### Relevant log output ```shell time=2025-03-05T02:26:12.765Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:12.897Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:13.016Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:13.134Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:13.134Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.4 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:13.251Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:13.375Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:13.493Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:13.617Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:13.617Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:13.739Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:13.874Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:13.993Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:14.112Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:14.113Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:14.237Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:14.359Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:14.478Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:14.595Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:14.595Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:14.712Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:14.837Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:14.958Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:15.083Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:15.083Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:15.202Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:15.329Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:15.450Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:15.581Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:15.582Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:15.704Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:15.828Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:15.947Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:16.068Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:16.068Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:16.187Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:16.309Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:16.432Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:16.551Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:16.551Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.205156345 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 time=2025-03-05T02:26:16.551Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:16.551Z level=DEBUG source=sched.go:385 msg="sending an unloaded event" modelPath=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 time=2025-03-05T02:26:16.551Z level=DEBUG source=sched.go:309 msg="ignoring unload event with no pending requests" time=2025-03-05T02:26:16.674Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:16.802Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:16.922Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:17.041Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:17.041Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.695083481 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 time=2025-03-05T02:26:17.041Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:26:17.158Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:17.282Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:26:17.401Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:26:17.520Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:26:17.520Z level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=6.174074521 model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 time=2025-03-05T02:28:04.347Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:28:04.496Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:28:04.629Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:28:04.762Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:28:04.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:28:04.933Z level=DEBUG source=sched.go:225 msg="loading first model" model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 time=2025-03-05T02:28:04.933Z level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[31.4 GiB]" time=2025-03-05T02:28:04.935Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-05T02:28:04.935Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-05T02:28:04.937Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 parallel=4 available=33766965248 required="21.5 GiB" time=2025-03-05T02:28:04.938Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="125.6 GiB" before.free="84.5 GiB" before.free_swap="0 B" now.total="125.6 GiB" now.free="84.4 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.520.61.05 dlsym: cuInit - 0x7f011e927290 dlsym: cuDriverGetVersion - 0x7f011e9272b0 dlsym: cuDeviceGetCount - 0x7f011e9272f0 dlsym: cuDeviceGet - 0x7f011e9272d0 dlsym: cuDeviceGetAttribute - 0x7f011e92e8a0 dlsym: cuDeviceGetUuid - 0x7f011e927330 dlsym: cuDeviceGetName - 0x7f011e927310 dlsym: cuCtxCreate_v3 - 0x7f011e92ea80 dlsym: cuMemGetInfo_v2 - 0x7f011e9392e0 dlsym: cuCtxDestroy - 0x7f011e983a20 calling cuInit calling cuDriverGetVersion raw version 0x2b48 CUDA driver version: 11.8 calling cuDeviceGetCount device count 4 time=2025-03-05T02:28:05.072Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f65bf98e-50af-336a-642c-350d745a2ba4 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:28:05.196Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-34b1d3af-3d84-4828-e90a-4045d85e12ba name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="27.2 GiB" now.total="31.7 GiB" now.free="27.2 GiB" now.used="4.6 GiB" time=2025-03-05T02:28:05.333Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a603b82c-c34f-0526-9e82-0c397eee31c0 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" time=2025-03-05T02:28:05.457Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-8e4b8be4-36ce-f555-afa6-4482247ce095 name="Tesla V100-PCIE-32GB" overhead="0 B" before.total="31.7 GiB" before.free="31.4 GiB" now.total="31.7 GiB" now.free="31.4 GiB" now.used="308.0 MiB" releasing cuda driver library time=2025-03-05T02:28:05.458Z level=INFO source=server.go:97 msg="system memory" total="125.6 GiB" free="84.4 GiB" free_swap="0 B" time=2025-03-05T02:28:05.458Z level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[31.4 GiB]" time=2025-03-05T02:28:05.458Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.key_length default=128 time=2025-03-05T02:28:05.458Z level=WARN source=ggml.go:136 msg="key not found" key=qwen2.attention.value_length default=128 time=2025-03-05T02:28:05.458Z level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[31.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" time=2025-03-05T02:28:05.459Z level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible=[] time=2025-03-05T02:28:05.460Z level=INFO source=server.go:380 msg="starting llama server" cmd="/data/ollama-brucemacd-ctx-shift-err/ollama runner --model /root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --verbose --threads 32 --parallel 4 --port 42115" time=2025-03-05T02:28:05.460Z level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_VERSION=11.8.0 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/data/ollama-brucemacd-ctx-shift-err/build/lib/ollama PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin CUDA_VISIBLE_DEVICES=GPU-f65bf98e-50af-336a-642c-350d745a2ba4]" time=2025-03-05T02:28:05.462Z level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-05T02:28:05.462Z level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-05T02:28:05.462Z level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-05T02:28:05.487Z level=INFO source=runner.go:938 msg="starting go runner" time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:78 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:78 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64 time=2025-03-05T02:28:05.487Z level=DEBUG source=ggml.go:84 msg="ggml backend load all from path" path=/data/ollama-brucemacd-ctx-shift-err/build/lib/ollama ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-sandybridge.so score: 20 ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-haswell.so score: 55 ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-icelake.so score: 0 ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-alderlake.so score: 0 ggml_backend_load_best: /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-skylakex.so score: 183 load_backend: loaded CPU backend from /data/ollama-brucemacd-ctx-shift-err/build/lib/ollama/libggml-cpu-skylakex.so time=2025-03-05T02:28:05.510Z level=INFO source=runner.go:941 msg=system info="CPU : LLAMAFILE = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=32 time=2025-03-05T02:28:05.510Z level=INFO source=runner.go:999 msg="Server listening on 127.0.0.1:42115" llama_model_loader: loaded meta data with 34 key-value pairs and 771 tensors from /root/.ollama/models/blobs/sha256-eabc98a9bcbfce7fd70f3e07de599f8fda98120fefed5881934161ede8bd1a41 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen2.5 32B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Qwen2.5 llama_model_loader: - kv 5: general.size_label str = 32B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3... llama_model_loader: - kv 8: general.base_model.count u32 = 1 llama_model_loader: - kv 9: general.base_model.0.name str = Qwen2.5 32B llama_model_loader: - kv 10: general.base_model.0.organization str = Qwen llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B llama_model_loader: - kv 12: general.tags arr[str,2] = ["chat", "text-generation"] llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 14: qwen2.block_count u32 = 64 llama_model_loader: - kv 15: qwen2.context_length u32 = 32768 llama_model_loader: - kv 16: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 17: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 18: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 19: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 20: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 22: general.file_type u32 = 15 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 33: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.48 GiB (4.85 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 22 time=2025-03-05T02:28:05.714Z level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" load: token to piece cache size = 0.9310 MB print_info: arch = qwen2 print_info: vocab_only = 0 print_info: n_ctx_train = 32768 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 40 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 5 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 27648 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = Qwen2.5 32B Instruct print_info: vocab type = BPE print_info: n_vocab = 152064 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU load_tensors: layer 1 assigned to device CPU load_tensors: layer 2 assigned to device CPU load_tensors: layer 3 assigned to device CPU load_tensors: layer 4 assigned to device CPU load_tensors: layer 5 assigned to device CPU load_tensors: layer 6 assigned to device CPU load_tensors: layer 7 assigned to device CPU load_tensors: layer 8 assigned to device CPU load_tensors: layer 9 assigned to device CPU load_tensors: layer 10 assigned to device CPU load_tensors: layer 11 assigned to device CPU load_tensors: layer 12 assigned to device CPU load_tensors: layer 13 assigned to device CPU load_tensors: layer 14 assigned to device CPU load_tensors: layer 15 assigned to device CPU load_tensors: layer 16 assigned to device CPU load_tensors: layer 17 assigned to device CPU load_tensors: layer 18 assigned to device CPU load_tensors: layer 19 assigned to device CPU load_tensors: layer 20 assigned to device CPU load_tensors: layer 21 assigned to device CPU load_tensors: layer 22 assigned to device CPU load_tensors: layer 23 assigned to device CPU load_tensors: layer 24 assigned to device CPU load_tensors: layer 25 assigned to device CPU load_tensors: layer 26 assigned to device CPU load_tensors: layer 27 assigned to device CPU load_tensors: layer 28 assigned to device CPU load_tensors: layer 29 assigned to device CPU load_tensors: layer 30 assigned to device CPU load_tensors: layer 31 assigned to device CPU load_tensors: layer 32 assigned to device CPU load_tensors: layer 33 assigned to device CPU load_tensors: layer 34 assigned to device CPU load_tensors: layer 35 assigned to device CPU load_tensors: layer 36 assigned to device CPU load_tensors: layer 37 assigned to device CPU load_tensors: layer 38 assigned to device CPU load_tensors: layer 39 assigned to device CPU load_tensors: layer 40 assigned to device CPU load_tensors: layer 41 assigned to device CPU load_tensors: layer 42 assigned to device CPU load_tensors: layer 43 assigned to device CPU load_tensors: layer 44 assigned to device CPU load_tensors: layer 45 assigned to device CPU load_tensors: layer 46 assigned to device CPU load_tensors: layer 47 assigned to device CPU load_tensors: layer 48 assigned to device CPU load_tensors: layer 49 assigned to device CPU load_tensors: layer 50 assigned to device CPU load_tensors: layer 51 assigned to device CPU load_tensors: layer 52 assigned to device CPU load_tensors: layer 53 assigned to device CPU load_tensors: layer 54 assigned to device CPU load_tensors: layer 55 assigned to device CPU load_tensors: layer 56 assigned to device CPU load_tensors: layer 57 assigned to device CPU load_tensors: layer 58 assigned to device CPU load_tensors: layer 59 assigned to device CPU load_tensors: layer 60 assigned to device CPU load_tensors: layer 61 assigned to device CPU load_tensors: layer 62 assigned to device CPU load_tensors: layer 63 assigned to device CPU load_tensors: layer 64 assigned to device CPU load_tensors: CPU_Mapped model buffer size = 18926.01 MiB ``` ### OS ubuntu 22.04 ### GPU root@b2b0c81a8393:/data/ollama-brucemacd-ctx-shift-err# nvidia-smi Wed Mar 5 03:40:07 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:06:00.0 Off | 0 | | N/A 36C P0 25W / 250W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:2F:00.0 Off | 0 | | N/A 40C P0 36W / 250W | 4380MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... Off | 00000000:86:00.0 Off | 0 | | N/A 36C P0 25W / 250W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... Off | 00000000:D8:00.0 Off | 0 | | N/A 33C P0 27W / 250W | 4MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ ### CPU Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz ### Ollama version I use a docker to build the environment spacewalkerjp/nvidia_cuda_11.8.0-cudnn8-runtime-ubuntu22.04_updated4automatic1111:latest
GiteaMirror added the bug label 2025-11-12 13:25:22 -06:00
Author
Owner

@ZiZi1noob commented on GitHub (Mar 5, 2025):

Maybe the same issue here (I think I did not have this error for the last version. But it shows this error since I updated ollama last Thur or Fri.):

ollama version is 0.5.12

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:1A:00.0 Off | N/A |
| 28% 27C P8 21W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti On | 00000000:1B:00.0 Off | N/A |
| 27% 25C P8 16W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 2080 Ti On | 00000000:3D:00.0 Off | N/A |
| 26% 24C P8 6W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 2080 Ti On | 00000000:3E:00.0 Off | N/A |
| 27% 26C P8 18W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce RTX 2080 Ti On | 00000000:88:00.0 Off | N/A |
| 27% 25C P8 1W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce RTX 2080 Ti On | 00000000:89:00.0 Off | N/A |
| 28% 27C P8 7W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce RTX 2080 Ti On | 00000000:B1:00.0 Off | N/A |
| 27% 27C P8 1W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce RTX 2080 Ti On | 00000000:B2:00.0 Off | N/A |
| 28% 26C P8 9W / 250W| 3MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Mar 05 12:25:45 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:45.680+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:45 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:45.865+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.053+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.230+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.410+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.587+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.768+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.947+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: releasing cuda driver library
Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.947+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=9.39857387 model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49
Mar 05 12:39:09 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:39:09 | 200 | 89.279µs | 127.0.0.1 | HEAD "/"
Mar 05 12:39:09 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:39:09 | 200 | 1.978093ms | 127.0.0.1 | GET "/api/tags"
Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 81.048µs | 127.0.0.1 | HEAD "/"
Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 2.825563ms | 127.0.0.1 | POST "/api/generate"
Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 49.407868ms | 127.0.0.1 | DELETE "/api/delete"
Mar 05 12:40:10 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:10 | 200 | 64.733µs | 127.0.0.1 | HEAD "/"
Mar 05 12:40:10 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:10 | 200 | 1.46592ms | 127.0.0.1 | GET "/api/tags"
Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 200 | 65.764µs | 127.0.0.1 | HEAD "/"
Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 200 | 2.820681ms | 127.0.0.1 | POST "/api/generate"
Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 500 | 1.686558ms | 127.0.0.1 | DELETE "/api/delete"
Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 200 | 80.746µs | 127.0.0.1 | HEAD "/"
Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 404 | 483.578µs | 127.0.0.1 | POST "/api/generate"
Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 404 | 282.957µs | 127.0.0.1 | DELETE "/api/delete"
Mar 05 12:40:41 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:41 | 200 | 67.33µs | 127.0.0.1 | HEAD "/"
Mar 05 12:40:41 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:41 | 200 | 221.661µs | 127.0.0.1 | GET "/api/tags"
Mar 05 12:42:22 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:42:22 | 200 | 69.751µs | 127.0.0.1 | HEAD "/"
Mar 05 12:42:22 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:42:22 | 200 | 230.367µs | 127.0.0.1 | GET "/api/tags"
Mar 05 12:43:00 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:00 | 200 | 70.833µs | 127.0.0.1 | HEAD "/"
Mar 05 12:43:00 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:00 | 404 | 432.174µs | 127.0.0.1 | POST "/api/show"
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:01 | 200 | 1.093714765s | 127.0.0.1 | POST "/api/pull"
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:01 | 200 | 94.582532ms | 127.0.0.1 | POST "/api/show"
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:01.897+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="376.6 GiB" before.free="369.6 GiB" before.free_swap="1.9 GiB" now.total="376.6 GiB" now.free="369.6 GiB" now.free_swap="1.9 GiB"
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuInit - 0x7f2ca25bda40
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDriverGetVersion - 0x7f2ca25bda60
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetCount - 0x7f2ca25bdaa0
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGet - 0x7f2ca25bda80
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetAttribute - 0x7f2ca25bdb80
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetUuid - 0x7f2ca25bdae0
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetName - 0x7f2ca25bdac0
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuCtxCreate_v3 - 0x7f2ca25c5030
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuMemGetInfo_v2 - 0x7f2ca25c5670
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuCtxDestroy - 0x7f2ca2610510
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuInit
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuDriverGetVersion
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: raw version 0x2eea
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: CUDA driver version: 12.1
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuDeviceGetCount
Mar 05 12:43:01 icrgpuserver1 ollama[27511]: device count 8
Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.098+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.310+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.510+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.714+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.926+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.133+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.340+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.542+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: releasing cuda driver library
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.710+08:00 level=DEBUG source=sched.go:225 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.710+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.6 GiB]"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 parallel=4 available=11380850688 required="5.6 GiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="376.6 GiB" before.free="369.6 GiB" before.free_swap="1.9 GiB" now.total="376.6 GiB" now.free="369.5 GiB" now.free_swap="1.9 GiB"
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuInit - 0x7f2ca25bda40
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDriverGetVersion - 0x7f2ca25bda60
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetCount - 0x7f2ca25bdaa0
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGet - 0x7f2ca25bda80
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetAttribute - 0x7f2ca25bdb80
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetUuid - 0x7f2ca25bdae0
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetName - 0x7f2ca25bdac0
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuCtxCreate_v3 - 0x7f2ca25c5030
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuMemGetInfo_v2 - 0x7f2ca25c5670
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuCtxDestroy - 0x7f2ca2610510
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuInit
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuDriverGetVersion
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: raw version 0x2eea
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: CUDA driver version: 12.1
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuDeviceGetCount
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: device count 8
Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.906+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.148+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.333+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.560+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.774+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.987+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.216+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: releasing cuda driver library
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=INFO source=server.go:97 msg="system memory" total="376.6 GiB" free="369.5 GiB" free_swap="1.9 GiB"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.6 GiB]"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible=[cuda_v11]
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v11
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --verbose --threads 16 --parallel 4 --port 44263"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[PATH=/home/ziyangzhan/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin CUDA_VISIBLE_DEVICES=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v11:/usr/local/lib/ollama]"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.505+08:00 level=INFO source=runner.go:932 msg="starting go runner"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.505+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v11
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.509+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.510+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=16
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.510+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:44263"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 (version GGUF V3 (latest))
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 0: general.architecture str = qwen2
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 1: general.type str = model
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 7B
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 4: general.size_label str = 7B
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 5: qwen2.block_count u32 = 28
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 3584
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 18944
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 28
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 4
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 13: general.file_type u32 = 15
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.695+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type f32: 141 tensors
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type q4_K: 169 tensors
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type q6_K: 29 tensors
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: special tokens cache size = 22
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_vocab: token to piece cache size = 0.9310 MB
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: format = GGUF V3 (latest)
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: arch = qwen2
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: vocab type = BPE
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_vocab = 152064
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_merges = 151387
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: vocab_only = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ctx_train = 131072
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd = 3584
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_layer = 28
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_head = 28
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_head_kv = 4
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_rot = 128
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_swa = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_head_k = 128
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_head_v = 128
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_gqa = 7
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_k_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_norm_eps = 0.0e+00
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_logit_scale = 0.0e+00
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ff = 18944
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_expert = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_expert_used = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: causal attn = 1
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: pooling type = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope type = 2
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope scaling = linear
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: freq_base_train = 10000.0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: freq_scale_train = 1
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope_finetuned = unknown
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_conv = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_inner = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_state = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_dt_rank = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model type = 7B
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model ftype = Q4_K - Medium
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model params = 7.62 B
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model size = 4.36 GiB (4.91 BPW)
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 7B
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: LF token = 148848 'ÄĬ'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: max token length = 256
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_tensors: CPU_Mapped model buffer size = 4460.45 MiB
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_seq_max = 4
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx = 8192
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx_per_seq = 2048
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_batch = 2048
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ubatch = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: flash_attn = 0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: freq_base = 10000.0
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: freq_scale = 1
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
Mar 05 12:43:07 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:07.959+08:00 level=DEBUG source=server.go:602 msg="model load progress 1.00"
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_kv_cache_init: CPU KV buffer size = 448.00 MiB
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: CPU output buffer size = 2.38 MiB
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: CPU compute buffer size = 492.01 MiB
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: graph nodes = 986
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: graph splits = 1
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.209+08:00 level=INFO source=server.go:596 msg="llama runner started in 2.77 seconds"
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:463 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:08 | 200 | 6.402278357s | 127.0.0.1 | POST "/api/generate"
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:467 msg="context for request finished"
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 duration=5m0s
Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 refCount=0
Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.515+08:00 level=DEBUG source=sched.go:576 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49
Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.515+08:00 level=DEBUG source=routes.go:1480 msg="chat request" images=0 prompt="<|User|>how to become more lucky?<|Assistant|>"
Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.531+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9
Mar 05 12:44:45 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:44:45 | 200 | 42.038µs | 127.0.0.1 | HEAD "/"
Mar 05 12:44:45 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:44:45 | 200 | 150.666µs | 127.0.0.1 | GET "/api/ps"
Mar 05 12:52:38 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:52:38 | 200 | 61.208µs | 127.0.0.1 | GET "/api/version"
Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:408 msg="context for request finished"
Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 duration=5m0s
Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 refCount=0
Mar 05 12:54:12 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:54:12 | 200 | 10m11s | 127.0.0.1 | POST "/api/chat"

@ZiZi1noob commented on GitHub (Mar 5, 2025): Maybe the same issue here (I think I did not have this error for the last version. But it shows this error since I updated ollama last Thur or Fri.): ollama version is 0.5.12 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:1A:00.0 Off | N/A | | 28% 27C P8 21W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 2080 Ti On | 00000000:1B:00.0 Off | N/A | | 27% 25C P8 16W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce RTX 2080 Ti On | 00000000:3D:00.0 Off | N/A | | 26% 24C P8 6W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce RTX 2080 Ti On | 00000000:3E:00.0 Off | N/A | | 27% 26C P8 18W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA GeForce RTX 2080 Ti On | 00000000:88:00.0 Off | N/A | | 27% 25C P8 1W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA GeForce RTX 2080 Ti On | 00000000:89:00.0 Off | N/A | | 28% 27C P8 7W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA GeForce RTX 2080 Ti On | 00000000:B1:00.0 Off | N/A | | 27% 27C P8 1W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA GeForce RTX 2080 Ti On | 00000000:B2:00.0 Off | N/A | | 28% 26C P8 9W / 250W| 3MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Mar 05 12:25:45 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:45.680+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:45 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:45.865+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.053+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.230+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.410+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.587+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.768+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.947+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:25:46 icrgpuserver1 ollama[27511]: releasing cuda driver library Mar 05 12:25:46 icrgpuserver1 ollama[27511]: time=2025-03-05T12:25:46.947+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=9.39857387 model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 Mar 05 12:39:09 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:39:09 | 200 | 89.279µs | 127.0.0.1 | HEAD "/" Mar 05 12:39:09 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:39:09 | 200 | 1.978093ms | 127.0.0.1 | GET "/api/tags" Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 81.048µs | 127.0.0.1 | HEAD "/" Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 2.825563ms | 127.0.0.1 | POST "/api/generate" Mar 05 12:40:07 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:07 | 200 | 49.407868ms | 127.0.0.1 | DELETE "/api/delete" Mar 05 12:40:10 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:10 | 200 | 64.733µs | 127.0.0.1 | HEAD "/" Mar 05 12:40:10 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:10 | 200 | 1.46592ms | 127.0.0.1 | GET "/api/tags" Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 200 | 65.764µs | 127.0.0.1 | HEAD "/" Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 200 | 2.820681ms | 127.0.0.1 | POST "/api/generate" Mar 05 12:40:18 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:18 | 500 | 1.686558ms | 127.0.0.1 | DELETE "/api/delete" Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 200 | 80.746µs | 127.0.0.1 | HEAD "/" Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 404 | 483.578µs | 127.0.0.1 | POST "/api/generate" Mar 05 12:40:33 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:33 | 404 | 282.957µs | 127.0.0.1 | DELETE "/api/delete" Mar 05 12:40:41 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:41 | 200 | 67.33µs | 127.0.0.1 | HEAD "/" Mar 05 12:40:41 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:40:41 | 200 | 221.661µs | 127.0.0.1 | GET "/api/tags" Mar 05 12:42:22 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:42:22 | 200 | 69.751µs | 127.0.0.1 | HEAD "/" Mar 05 12:42:22 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:42:22 | 200 | 230.367µs | 127.0.0.1 | GET "/api/tags" Mar 05 12:43:00 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:00 | 200 | 70.833µs | 127.0.0.1 | HEAD "/" Mar 05 12:43:00 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:00 | 404 | 432.174µs | 127.0.0.1 | POST "/api/show" Mar 05 12:43:01 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:01 | 200 | 1.093714765s | 127.0.0.1 | POST "/api/pull" Mar 05 12:43:01 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:01 | 200 | 94.582532ms | 127.0.0.1 | POST "/api/show" Mar 05 12:43:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:01.897+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="376.6 GiB" before.free="369.6 GiB" before.free_swap="1.9 GiB" now.total="376.6 GiB" now.free="369.6 GiB" now.free_swap="1.9 GiB" Mar 05 12:43:01 icrgpuserver1 ollama[27511]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuInit - 0x7f2ca25bda40 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDriverGetVersion - 0x7f2ca25bda60 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetCount - 0x7f2ca25bdaa0 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGet - 0x7f2ca25bda80 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetAttribute - 0x7f2ca25bdb80 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetUuid - 0x7f2ca25bdae0 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetName - 0x7f2ca25bdac0 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuCtxCreate_v3 - 0x7f2ca25c5030 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuMemGetInfo_v2 - 0x7f2ca25c5670 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: dlsym: cuCtxDestroy - 0x7f2ca2610510 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuInit Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuDriverGetVersion Mar 05 12:43:01 icrgpuserver1 ollama[27511]: raw version 0x2eea Mar 05 12:43:01 icrgpuserver1 ollama[27511]: CUDA driver version: 12.1 Mar 05 12:43:01 icrgpuserver1 ollama[27511]: calling cuDeviceGetCount Mar 05 12:43:01 icrgpuserver1 ollama[27511]: device count 8 Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.098+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.310+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.510+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.714+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:02 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:02.926+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.133+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.340+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.542+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: releasing cuda driver library Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.710+08:00 level=DEBUG source=sched.go:225 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.710+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.6 GiB]" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 parallel=4 available=11380850688 required="5.6 GiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.711+08:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="376.6 GiB" before.free="369.6 GiB" before.free_swap="1.9 GiB" now.total="376.6 GiB" now.free="369.5 GiB" now.free_swap="1.9 GiB" Mar 05 12:43:03 icrgpuserver1 ollama[27511]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuInit - 0x7f2ca25bda40 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDriverGetVersion - 0x7f2ca25bda60 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetCount - 0x7f2ca25bdaa0 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGet - 0x7f2ca25bda80 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetAttribute - 0x7f2ca25bdb80 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetUuid - 0x7f2ca25bdae0 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuDeviceGetName - 0x7f2ca25bdac0 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuCtxCreate_v3 - 0x7f2ca25c5030 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuMemGetInfo_v2 - 0x7f2ca25c5670 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: dlsym: cuCtxDestroy - 0x7f2ca2610510 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuInit Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuDriverGetVersion Mar 05 12:43:03 icrgpuserver1 ollama[27511]: raw version 0x2eea Mar 05 12:43:03 icrgpuserver1 ollama[27511]: CUDA driver version: 12.1 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: calling cuDeviceGetCount Mar 05 12:43:03 icrgpuserver1 ollama[27511]: device count 8 Mar 05 12:43:03 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:03.906+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.148+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6ba36800-e68f-5e54-5e83-cb8555a73cd8 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.333+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6e857c0d-5334-0c6b-f29a-bd0e7c1e0c06 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.560+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-f23fc23e-840d-e9ba-5b1d-c7893c8dc1e1 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.774+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-e14d12f9-b786-b011-ac9f-43e06cbf71cf name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:04 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:04.987+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-a269d824-6458-5282-5c17-b6a61287ac7a name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.216+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-80feaeba-a5b6-0952-27d9-6cffdfcfa811 name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-d08a0dfb-2cff-8067-3ca3-2e2577e76dfc name="NVIDIA GeForce RTX 2080 Ti" overhead="0 B" before.total="10.8 GiB" before.free="10.6 GiB" now.total="10.8 GiB" now.free="10.6 GiB" now.used="157.9 MiB" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: releasing cuda driver library Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=INFO source=server.go:97 msg="system memory" total="376.6 GiB" free="369.5 GiB" free_swap="1.9 GiB" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.440+08:00 level=DEBUG source=memory.go:108 msg=evaluating library=cuda gpu_count=1 available="[10.6 GiB]" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.6 GiB" memory.required.partial="5.6 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[5.6 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.441+08:00 level=DEBUG source=server.go:259 msg="compatible gpu libraries" compatible=[cuda_v11] Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=DEBUG source=server.go:302 msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v11 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --verbose --threads 16 --parallel 4 --port 44263" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.442+08:00 level=DEBUG source=server.go:398 msg=subprocess environment="[PATH=/home/ziyangzhan/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin CUDA_VISIBLE_DEVICES=GPU-06ecf0c4-810d-f820-2ef1-b2a07fb32c77 LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v11:/usr/local/lib/ollama]" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.443+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.505+08:00 level=INFO source=runner.go:932 msg="starting go runner" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.505+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v11 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.509+08:00 level=DEBUG source=ggml.go:89 msg="ggml backend load all from path" path=/usr/local/lib/ollama Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.510+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=16 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.510+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:44263" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 (version GGUF V3 (latest)) Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 0: general.architecture str = qwen2 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 1: general.type str = model Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 7B Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 4: general.size_label str = 7B Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 5: qwen2.block_count u32 = 28 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 3584 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 18944 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 28 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 4 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 10000.000000 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 13: general.file_type u32 = 15 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 Mar 05 12:43:05 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:05.695+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model" Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Mar 05 12:43:05 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type f32: 141 tensors Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type q4_K: 169 tensors Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llama_model_loader: - type q6_K: 29 tensors Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Mar 05 12:43:06 icrgpuserver1 ollama[27511]: llm_load_vocab: special tokens cache size = 22 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_vocab: token to piece cache size = 0.9310 MB Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: format = GGUF V3 (latest) Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: arch = qwen2 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: vocab type = BPE Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_vocab = 152064 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_merges = 151387 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: vocab_only = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ctx_train = 131072 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd = 3584 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_layer = 28 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_head = 28 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_head_kv = 4 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_rot = 128 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_swa = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_head_k = 128 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_head_v = 128 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_gqa = 7 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_k_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_norm_eps = 0.0e+00 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: f_logit_scale = 0.0e+00 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ff = 18944 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_expert = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_expert_used = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: causal attn = 1 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: pooling type = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope type = 2 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope scaling = linear Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: freq_base_train = 10000.0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: freq_scale_train = 1 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: rope_finetuned = unknown Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_conv = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_inner = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_d_state = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_dt_rank = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model type = 7B Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model ftype = Q4_K - Medium Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model params = 7.62 B Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: model size = 4.36 GiB (4.91 BPW) Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 7B Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: LF token = 148848 'ÄĬ' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151663 '<|repo_name|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: EOG token = 151664 '<|file_sep|>' Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_print_meta: max token length = 256 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llm_load_tensors: CPU_Mapped model buffer size = 4460.45 MiB Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_seq_max = 4 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx = 8192 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx_per_seq = 2048 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_batch = 2048 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ubatch = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: flash_attn = 0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: freq_base = 10000.0 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: freq_scale = 1 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512 Mar 05 12:43:07 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:07.959+08:00 level=DEBUG source=server.go:602 msg="model load progress 1.00" Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_kv_cache_init: CPU KV buffer size = 448.00 MiB Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: CPU output buffer size = 2.38 MiB Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: CPU compute buffer size = 492.01 MiB Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: graph nodes = 986 Mar 05 12:43:08 icrgpuserver1 ollama[27511]: llama_new_context_with_model: graph splits = 1 Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.209+08:00 level=INFO source=server.go:596 msg="llama runner started in 2.77 seconds" Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:463 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 Mar 05 12:43:08 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:43:08 | 200 | 6.402278357s | 127.0.0.1 | POST "/api/generate" Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:467 msg="context for request finished" Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 duration=5m0s Mar 05 12:43:08 icrgpuserver1 ollama[27511]: time=2025-03-05T12:43:08.210+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 refCount=0 Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.515+08:00 level=DEBUG source=sched.go:576 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.515+08:00 level=DEBUG source=routes.go:1480 msg="chat request" images=0 prompt="<|User|>how to become more lucky?<|Assistant|>" Mar 05 12:44:01 icrgpuserver1 ollama[27511]: time=2025-03-05T12:44:01.531+08:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9 Mar 05 12:44:45 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:44:45 | 200 | 42.038µs | 127.0.0.1 | HEAD "/" Mar 05 12:44:45 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:44:45 | 200 | 150.666µs | 127.0.0.1 | GET "/api/ps" Mar 05 12:52:38 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:52:38 | 200 | 61.208µs | 127.0.0.1 | GET "/api/version" Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:408 msg="context for request finished" Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:340 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 duration=5m0s Mar 05 12:54:12 icrgpuserver1 ollama[27511]: time=2025-03-05T12:54:12.456+08:00 level=DEBUG source=sched.go:358 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49 refCount=0 Mar 05 12:54:12 icrgpuserver1 ollama[27511]: [GIN] 2025/03/05 - 12:54:12 | 200 | 10m11s | 127.0.0.1 | POST "/api/chat"
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#6201