[GH-ISSUE #4365] llava can't run #2725

Closed
opened 2026-04-12 13:02:30 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Elminsst on GitHub (May 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4365

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

ollama run llava
but it didn't work
image
the sever.log is
[GIN] 2024/05/11 - 23:10:27 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/11 - 23:10:27 | 200 | 1.0406ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/11 - 23:10:27 | 200 | 513.8µs | 127.0.0.1 | POST "/api/show"
time=2024-05-11T23:10:27.560+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-11T23:10:27.565+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
time=2024-05-11T23:10:27.566+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet"
time=2024-05-11T23:10:27.576+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\Users\Elmin\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model D:\AI\语言模型\models\Repository\blobs\sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mmproj D:\AI\语言模型\models\Repository\blobs\sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 14750"
time=2024-05-11T23:10:27.580+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1
time=2024-05-11T23:10:27.580+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding"
time=2024-05-11T23:10:27.581+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=2770 commit="952d03d" tid="38720" timestamp=1715440227
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="38720" timestamp=1715440227 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="14750" tid="38720" timestamp=1715440227
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ERROR [load_model] unable to load clip model | model="D:\AI\语言模型\models\Repository\blobs\sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539" tid="38720" timestamp=1715440227
time=2024-05-11T23:10:27.832+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
[GIN] 2024/05/11 - 23:10:27 | 500 | 588.9624ms | 127.0.0.1 | POST "/api/chat"
time=2024-05-11T23:10:32.924+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0920576
time=2024-05-11T23:10:33.174+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3417755
time=2024-05-11T23:10:33.424+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5916134

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.36

Originally created by @Elminsst on GitHub (May 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4365 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? ollama run llava but it didn't work ![image](https://github.com/ollama/ollama/assets/130235860/9c500b70-8e56-4483-96da-a3ef505d0db0) the sever.log is [GIN] 2024/05/11 - 23:10:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/11 - 23:10:27 | 200 | 1.0406ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/11 - 23:10:27 | 200 | 513.8µs | 127.0.0.1 | POST "/api/show" time=2024-05-11T23:10:27.560+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB" time=2024-05-11T23:10:27.565+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB" time=2024-05-11T23:10:27.566+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet" time=2024-05-11T23:10:27.576+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\\Users\\Elmin\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mmproj D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 14750" time=2024-05-11T23:10:27.580+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1 time=2024-05-11T23:10:27.580+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding" time=2024-05-11T23:10:27.581+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=2770 commit="952d03d" tid="38720" timestamp=1715440227 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="38720" timestamp=1715440227 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="14750" tid="38720" timestamp=1715440227 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes ERROR [load_model] unable to load clip model | model="D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539" tid="38720" timestamp=1715440227 time=2024-05-11T23:10:27.832+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/05/11 - 23:10:27 | 500 | 588.9624ms | 127.0.0.1 | POST "/api/chat" time=2024-05-11T23:10:32.924+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0920576 time=2024-05-11T23:10:33.174+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3417755 time=2024-05-11T23:10:33.424+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5916134 ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.36
GiteaMirror added the bugwindows labels 2026-04-12 13:02:30 -05:00
Author
Owner

@igorschlum commented on GitHub (May 12, 2024):

Hi @Elminsst is working on MacOS. Can you try with version 0.1.37?

<!-- gh-comment-id:2106360173 --> @igorschlum commented on GitHub (May 12, 2024): Hi @Elminsst is working on MacOS. Can you try with version 0.1.37?
Author
Owner

@Elminsst commented on GitHub (May 13, 2024):

i have tried,but still the same issue.
image

[GIN] 2024/05/13 - 09:18:56 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/13 - 09:18:56 | 200 | 11.1365ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/13 - 09:18:56 | 200 | 1.0432ms | 127.0.0.1 | POST "/api/show"
time=2024-05-13T09:18:56.992+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="14.9 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="6.9 GiB" memory.weights.repeating="6.8 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB"
time=2024-05-13T09:18:57.012+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="14.9 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="6.9 GiB" memory.weights.repeating="6.8 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB"
time=2024-05-13T09:18:57.012+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet"
time=2024-05-13T09:18:57.055+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\Users\Elmin\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model D:\AI\语言模型\models\Repository\blobs\sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --mmproj D:\AI\语言模型\models\Repository\blobs\sha256-42037f9f4c1b801eebaec1545ed144b8b0fa8259672158fb69c8c68f02cfe00c --parallel 1 --port 7839"
time=2024-05-13T09:18:57.066+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1
time=2024-05-13T09:18:57.066+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding"
time=2024-05-13T09:18:57.068+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=2770 commit="952d03d" tid="28724" timestamp=1715563137
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="28724" timestamp=1715563137 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="7839" tid="28724" timestamp=1715563137
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ERROR [load_model] unable to load clip model | model="D:\AI\语言模型\models\Repository\blobs\sha256-42037f9f4c1b801eebaec1545ed144b8b0fa8259672158fb69c8c68f02cfe00c" tid="28724" timestamp=1715563137
time=2024-05-13T09:18:57.641+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
[GIN] 2024/05/13 - 09:18:57 | 500 | 1.5742949s | 127.0.0.1 | POST "/api/chat"
time=2024-05-13T09:19:02.807+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.1662627
time=2024-05-13T09:19:03.053+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4122445
time=2024-05-13T09:19:03.314+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6727232
[GIN] 2024/05/13 - 09:19:11 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/13 - 09:19:11 | 200 | 18.203ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/13 - 09:19:11 | 200 | 538µs | 127.0.0.1 | POST "/api/show"
time=2024-05-13T09:19:13.016+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="512.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="305.0 MiB"
time=2024-05-13T09:19:13.048+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="512.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="305.0 MiB"
time=2024-05-13T09:19:13.050+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet"
time=2024-05-13T09:19:13.093+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\Users\Elmin\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model D:\AI\语言模型\models\Repository\blobs\sha256-deb26e54ccebc9ce0665bf35dfdc73f2989b7308ecf9b08bd897a9a1ec9cb455 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mmproj D:\AI\语言模型\models\Repository\blobs\sha256-addb9fdda3a5a9ffe3376f97583a2ea160a1050f1393ba4d45fa4a3e6c884867 --parallel 1 --port 7908"
time=2024-05-13T09:19:13.106+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1
time=2024-05-13T09:19:13.106+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding"
time=2024-05-13T09:19:13.107+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=2770 commit="952d03d" tid="38488" timestamp=1715563153
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="38488" timestamp=1715563153 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="7908" tid="38488" timestamp=1715563153
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
ERROR [load_model] unable to load clip model | model="D:\AI\语言模型\models\Repository\blobs\sha256-addb9fdda3a5a9ffe3376f97583a2ea160a1050f1393ba4d45fa4a3e6c884867" tid="38488" timestamp=1715563153
time=2024-05-13T09:19:13.372+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
[GIN] 2024/05/13 - 09:19:13 | 500 | 1.531562s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/05/13 - 09:19:16 | 200 | 0s | 127.0.0.1 | GET "/api/version"
time=2024-05-13T09:19:18.482+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.1094424
time=2024-05-13T09:19:18.729+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3568757
time=2024-05-13T09:19:18.991+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6188797

<!-- gh-comment-id:2106463594 --> @Elminsst commented on GitHub (May 13, 2024): i have tried,but still the same issue. ![image](https://github.com/ollama/ollama/assets/130235860/9816a734-531e-4f9d-bb42-6b235368d1f9) [GIN] 2024/05/13 - 09:18:56 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/13 - 09:18:56 | 200 | 11.1365ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/13 - 09:18:56 | 200 | 1.0432ms | 127.0.0.1 | POST "/api/show" time=2024-05-13T09:18:56.992+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="14.9 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="6.9 GiB" memory.weights.repeating="6.8 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB" time=2024-05-13T09:18:57.012+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=41 memory.available="14.9 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="6.9 GiB" memory.weights.repeating="6.8 GiB" memory.weights.nonrepeating="128.2 MiB" memory.graph.full="204.0 MiB" memory.graph.partial="244.1 MiB" time=2024-05-13T09:18:57.012+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet" time=2024-05-13T09:18:57.055+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\\Users\\Elmin\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-87d5b13e5157d3a67f8e10a46d8a846ec2b68c1f731e3dfe1546a585432b8fa0 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --mmproj D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-42037f9f4c1b801eebaec1545ed144b8b0fa8259672158fb69c8c68f02cfe00c --parallel 1 --port 7839" time=2024-05-13T09:18:57.066+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1 time=2024-05-13T09:18:57.066+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding" time=2024-05-13T09:18:57.068+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=2770 commit="952d03d" tid="28724" timestamp=1715563137 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="28724" timestamp=1715563137 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="7839" tid="28724" timestamp=1715563137 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes ERROR [load_model] unable to load clip model | model="D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-42037f9f4c1b801eebaec1545ed144b8b0fa8259672158fb69c8c68f02cfe00c" tid="28724" timestamp=1715563137 time=2024-05-13T09:18:57.641+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/05/13 - 09:18:57 | 500 | 1.5742949s | 127.0.0.1 | POST "/api/chat" time=2024-05-13T09:19:02.807+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.1662627 time=2024-05-13T09:19:03.053+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4122445 time=2024-05-13T09:19:03.314+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6727232 [GIN] 2024/05/13 - 09:19:11 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/05/13 - 09:19:11 | 200 | 18.203ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/05/13 - 09:19:11 | 200 | 538µs | 127.0.0.1 | POST "/api/show" time=2024-05-13T09:19:13.016+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="512.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="305.0 MiB" time=2024-05-13T09:19:13.048+08:00 level=INFO source=memory.go:127 msg="offload to gpu" layers.real=-1 layers.estimate=33 memory.available="14.9 GiB" memory.required.full="5.7 GiB" memory.required.partial="5.7 GiB" memory.required.kv="512.0 MiB" memory.weights.total="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="305.0 MiB" time=2024-05-13T09:19:13.050+08:00 level=WARN source=server.go:207 msg="multimodal models don't support parallel requests yet" time=2024-05-13T09:19:13.093+08:00 level=INFO source=server.go:318 msg="starting llama server" cmd="C:\\Users\\Elmin\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-deb26e54ccebc9ce0665bf35dfdc73f2989b7308ecf9b08bd897a9a1ec9cb455 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --mmproj D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-addb9fdda3a5a9ffe3376f97583a2ea160a1050f1393ba4d45fa4a3e6c884867 --parallel 1 --port 7908" time=2024-05-13T09:19:13.106+08:00 level=INFO source=sched.go:333 msg="loaded runners" count=1 time=2024-05-13T09:19:13.106+08:00 level=INFO source=server.go:488 msg="waiting for llama runner to start responding" time=2024-05-13T09:19:13.107+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=2770 commit="952d03d" tid="38488" timestamp=1715563153 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="38488" timestamp=1715563153 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="7908" tid="38488" timestamp=1715563153 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes ERROR [load_model] unable to load clip model | model="D:\\AI\\语言模型\\models\\Repository\\blobs\\sha256-addb9fdda3a5a9ffe3376f97583a2ea160a1050f1393ba4d45fa4a3e6c884867" tid="38488" timestamp=1715563153 time=2024-05-13T09:19:13.372+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/05/13 - 09:19:13 | 500 | 1.531562s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/05/13 - 09:19:16 | 200 | 0s | 127.0.0.1 | GET "/api/version" time=2024-05-13T09:19:18.482+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.1094424 time=2024-05-13T09:19:18.729+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.3568757 time=2024-05-13T09:19:18.991+08:00 level=WARN source=sched.go:507 msg="gpu VRAM usage didn't recover within timeout" seconds=5.6188797
Author
Owner

@Elminsst commented on GitHub (May 13, 2024):

othe models can run well
image

<!-- gh-comment-id:2106466494 --> @Elminsst commented on GitHub (May 13, 2024): othe models can run well ![image](https://github.com/ollama/ollama/assets/130235860/463d692f-615f-4427-a78f-4de1eed076ad)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2725