[GH-ISSUE #9541] llama_model_load_from_file_impl: failed to load model #6223

Closed
opened 2026-04-12 17:37:55 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @Nondirectional on GitHub (Mar 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9541

What is the issue?

Using the Model shaw/dmeta-embedding-zh:latest to embed query raise error:

llama runner process has terminated: error loading model: llama_model_loader: failed to load model from C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd

I updated Ollama to version 0.5.13 yesterday, on the old version ollama it was working fine. This problem has occurred since I updated to 0.5.13

I can confirm that the sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd mentioned in the log actually exists

Relevant log output

time=2025-03-06T15:52:54.665+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f4521a16-c6d1-f138-b009-e7c133f41253 library=cuda total="24.0 GiB" available="17.0 GiB"
time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.key_length default=64
time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.value_length default=64
time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-03-06T15:52:54.665+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd gpu=GPU-f4521a16-c6d1-f138-b009-e7c133f41253 parallel=1 available=18235318272 required="888.9 MiB"
time=2025-03-06T15:52:54.678+08:00 level=INFO source=server.go:97 msg="system memory" total="61.7 GiB" free="49.1 GiB" free_swap="42.4 GiB"
time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.key_length default=64
time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.value_length default=64
time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-03-06T15:52:54.678+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="888.9 MiB" memory.required.partial="888.9 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[888.9 MiB]" memory.weights.total="330.5 MiB" memory.weights.repeating="268.6 MiB" memory.weights.nonrepeating="61.9 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB"
time=2025-03-06T15:52:54.679+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 12 --no-mmap --parallel 1 --port 57570"
time=2025-03-06T15:52:54.683+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=2
time=2025-03-06T15:52:54.683+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-06T15:52:54.683+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-06T15:52:54.703+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-03-06T15:52:54.772+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CUDA : ARCHS = 500,600,610,700,750,800,860,870,890,900,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(clang)" threads=12
time=2025-03-06T15:52:54.772+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:57570"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) - 23042 MiB free
gguf_init_from_file_impl: duplicate key 'tokenizer.ggml.bos_token_id' for tensors 12 and 23 
gguf_init_from_file_impl: failed to read key-value pairs
llama_model_load: error loading model: llama_model_loader: failed to load model from C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd

llama_model_load_from_file_impl: failed to load model
panic: unable to load model: C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd

goroutine 28 [running]:
github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004e6000, {0xd, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000512250, 0x0}, ...)
	C:/a/ollama/ollama/runner/llamarunner/runner.go:851 +0x375
created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
	C:/a/ollama/ollama/runner/llamarunner/runner.go:966 +0xcb7
time=2025-03-06T15:52:54.859+08:00 level=ERROR source=server.go:421 msg="llama runner terminated" error="exit status 2"
time=2025-03-06T15:52:54.933+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: llama_model_loader: failed to load model from C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd"
[GIN] 2025/03/06 - 15:52:54 | 500 |    2.5998901s |     10.100.1.47 | POST     "/api/embed"
[GIN] 2025/03/06 - 15:52:55 | 200 |    598.4776ms |     10.100.1.47 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.5.13

Originally created by @Nondirectional on GitHub (Mar 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9541 ### What is the issue? Using the Model `shaw/dmeta-embedding-zh:latest` to embed query raise error: ```text llama runner process has terminated: error loading model: llama_model_loader: failed to load model from C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd ``` I updated Ollama to version 0.5.13 yesterday, on the old version ollama it was working fine. This problem has occurred since I updated to 0.5.13 I can confirm that the `sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd` mentioned in the log actually exists ### Relevant log output ```shell time=2025-03-06T15:52:54.665+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-f4521a16-c6d1-f138-b009-e7c133f41253 library=cuda total="24.0 GiB" available="17.0 GiB" time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.key_length default=64 time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.value_length default=64 time=2025-03-06T15:52:54.665+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-03-06T15:52:54.665+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd gpu=GPU-f4521a16-c6d1-f138-b009-e7c133f41253 parallel=1 available=18235318272 required="888.9 MiB" time=2025-03-06T15:52:54.678+08:00 level=INFO source=server.go:97 msg="system memory" total="61.7 GiB" free="49.1 GiB" free_swap="42.4 GiB" time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.key_length default=64 time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.value_length default=64 time=2025-03-06T15:52:54.678+08:00 level=WARN source=ggml.go:136 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-03-06T15:52:54.678+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[17.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="888.9 MiB" memory.required.partial="888.9 MiB" memory.required.kv="6.0 MiB" memory.required.allocations="[888.9 MiB]" memory.weights.total="330.5 MiB" memory.weights.repeating="268.6 MiB" memory.weights.nonrepeating="61.9 MiB" memory.graph.full="12.0 MiB" memory.graph.partial="12.0 MiB" time=2025-03-06T15:52:54.679+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd --ctx-size 2048 --batch-size 512 --n-gpu-layers 13 --threads 12 --no-mmap --parallel 1 --port 57570" time=2025-03-06T15:52:54.683+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=2 time=2025-03-06T15:52:54.683+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding" time=2025-03-06T15:52:54.683+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error" time=2025-03-06T15:52:54.703+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-03-06T15:52:54.772+08:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CUDA : ARCHS = 500,600,610,700,750,800,860,870,890,900,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(clang)" threads=12 time=2025-03-06T15:52:54.772+08:00 level=INFO source=runner.go:992 msg="Server listening on 127.0.0.1:57570" llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) - 23042 MiB free gguf_init_from_file_impl: duplicate key 'tokenizer.ggml.bos_token_id' for tensors 12 and 23 gguf_init_from_file_impl: failed to read key-value pairs llama_model_load: error loading model: llama_model_loader: failed to load model from C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd llama_model_load_from_file_impl: failed to load model panic: unable to load model: C:\Users\Administrator\.ollama\models\blobs\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd goroutine 28 [running]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004e6000, {0xd, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000512250, 0x0}, ...) C:/a/ollama/ollama/runner/llamarunner/runner.go:851 +0x375 created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 C:/a/ollama/ollama/runner/llamarunner/runner.go:966 +0xcb7 time=2025-03-06T15:52:54.859+08:00 level=ERROR source=server.go:421 msg="llama runner terminated" error="exit status 2" time=2025-03-06T15:52:54.933+08:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: error loading model: llama_model_loader: failed to load model from C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-26bd607a51eb1f3a0d3beac444b977e03fa745def499add60c996c08c8c2ddcd" [GIN] 2025/03/06 - 15:52:54 | 500 | 2.5998901s | 10.100.1.47 | POST "/api/embed" [GIN] 2025/03/06 - 15:52:55 | 200 | 598.4776ms | 10.100.1.47 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.5.13
GiteaMirror added the bug label 2026-04-12 17:37:55 -05:00
Author
Owner

@jeffcrush commented on GitHub (Mar 6, 2025):

I have the same problem

<!-- gh-comment-id:2704014120 --> @jeffcrush commented on GitHub (Mar 6, 2025): I have the same problem
Author
Owner

@Nondirectional commented on GitHub (Mar 7, 2025):

I have the same problem

i just shaw/dmeta-embedding-zh:latest has it problem, i try change another embeddings model.

<!-- gh-comment-id:2705340811 --> @Nondirectional commented on GitHub (Mar 7, 2025): > I have the same problem i just shaw/dmeta-embedding-zh:latest has it problem, i try change another embeddings model.
Author
Owner

@ganjuesuifengzou commented on GitHub (Mar 9, 2025):

我更新到0.5.13后也是加载shaw/dmeta-embedding-zh失败。有什么解决办法没有?或者哪个嵌入模型更好一些?

<!-- gh-comment-id:2708632830 --> @ganjuesuifengzou commented on GitHub (Mar 9, 2025): 我更新到0.5.13后也是加载shaw/dmeta-embedding-zh失败。有什么解决办法没有?或者哪个嵌入模型更好一些?
Author
Owner

@Nondirectional commented on GitHub (Mar 9, 2025):

我更新到0.5.13后也是加载shaw/dmeta-embedding-zh失败。有什么解决办法没有?或者哪个嵌入模型更好一些?

我在huggingface下载了gte-Qwen2-1.5B 部署到了 Ollama,你可以试试。

<!-- gh-comment-id:2708766555 --> @Nondirectional commented on GitHub (Mar 9, 2025): > 我更新到0.5.13后也是加载shaw/dmeta-embedding-zh失败。有什么解决办法没有?或者哪个嵌入模型更好一些? 我在huggingface下载了[gte-Qwen2-1.5B](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) 部署到了 Ollama,你可以试试。
Author
Owner

@mengpeiwei commented on GitHub (Mar 13, 2025):

i have the same problem...

<!-- gh-comment-id:2719755364 --> @mengpeiwei commented on GitHub (Mar 13, 2025): i have the same problem...
Author
Owner

@Jiez2024 commented on GitHub (Mar 13, 2025):

same problem

<!-- gh-comment-id:2720005028 --> @Jiez2024 commented on GitHub (Mar 13, 2025): same problem
Author
Owner

@mengpeiwei commented on GitHub (Mar 13, 2025):

return to version 0.5.12 is ok...

<!-- gh-comment-id:2720088683 --> @mengpeiwei commented on GitHub (Mar 13, 2025): return to version 0.5.12 is ok...
Author
Owner

@mbbyn commented on GitHub (Mar 13, 2025):

Same here, wanted to share my system info and some details.
snowflake-arctic-embed2:latest worked fine for me on 0.6.0 after breaking on 0.5.13, but any larger model, which still fits on my 24GB VRAM fails.

OS
Ubuntu 20.04 (WSL2)

GPU
Nvidia Quadro RTX 6000
Driver Version: 572.16
CUDA Version: 12.8

CPU
AMD

Ollama version
0.5.13 and 0.6.0
Works fine on 0.5.12

<!-- gh-comment-id:2721198662 --> @mbbyn commented on GitHub (Mar 13, 2025): Same here, wanted to share my system info and some details. snowflake-arctic-embed2:latest worked fine for me on 0.6.0 after breaking on 0.5.13, but any larger model, which still fits on my 24GB VRAM fails. **OS** Ubuntu 20.04 (WSL2) **GPU** Nvidia Quadro RTX 6000 Driver Version: 572.16 CUDA Version: 12.8 **CPU** AMD **Ollama version** 0.5.13 and 0.6.0 Works fine on 0.5.12
Author
Owner

@Wu-Jianqiang commented on GitHub (Mar 16, 2025):

Failed: ollama pull shaw/dmeta-embedding-zh
OK: ollama pull shaw/dmeta-embedding-zh-q4

ollama version is 0.6.1

<!-- gh-comment-id:2727452911 --> @Wu-Jianqiang commented on GitHub (Mar 16, 2025): Failed: ollama pull shaw/dmeta-embedding-zh OK: ollama pull shaw/dmeta-embedding-zh-q4 ollama version is 0.6.1
Author
Owner

@mbbyn commented on GitHub (Mar 23, 2025):

It's working great for me with Ollama version 0.6.2

<!-- gh-comment-id:2746094579 --> @mbbyn commented on GitHub (Mar 23, 2025): It's working great for me with Ollama version 0.6.2
Author
Owner

@qianzhouyi2 commented on GitHub (Mar 24, 2025):

still failed shaw/dmeta-embedding-zh

<!-- gh-comment-id:2748045370 --> @qianzhouyi2 commented on GitHub (Mar 24, 2025): still failed shaw/dmeta-embedding-zh
Author
Owner

@zmldndx commented on GitHub (Apr 7, 2025):

hope to fix

<!-- gh-comment-id:2782189548 --> @zmldndx commented on GitHub (Apr 7, 2025): hope to fix
Author
Owner

@phoenixlucky commented on GitHub (Apr 11, 2025):

0.6.5 have the same problem...

<!-- gh-comment-id:2796870795 --> @phoenixlucky commented on GitHub (Apr 11, 2025): 0.6.5 have the same problem...
Author
Owner

@wikty commented on GitHub (Apr 13, 2025):

I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model:

ollama rm shaw/dmeta-embedding-zh

ollama pull shaw/dmeta-embedding-zh
<!-- gh-comment-id:2800017755 --> @wikty commented on GitHub (Apr 13, 2025): I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model: ``` ollama rm shaw/dmeta-embedding-zh ollama pull shaw/dmeta-embedding-zh ```
Author
Owner

@phoenixlucky commented on GitHub (Apr 14, 2025):

I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model:

ollama rm shaw/dmeta-embedding-zh

ollama pull shaw/dmeta-embedding-zh

Thank you, it has been successfully added

<!-- gh-comment-id:2800311654 --> @phoenixlucky commented on GitHub (Apr 14, 2025): > I'm the maintainer of dmeta-embedding-zh, compatibility issue with ollama 0.6.x has been fixed, please re-download the model: > > ``` > ollama rm shaw/dmeta-embedding-zh > > ollama pull shaw/dmeta-embedding-zh > ``` Thank you, it has been successfully added
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6223