[GH-ISSUE #12897] ollama run modelscope.cn/unsloth/Qwen3-VL-8B-Instruct-GGUF failed #70606

Open
opened 2026-05-04 22:14:25 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @jiaolongxue on GitHub (Nov 1, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12897

What is the issue?

print_info: file size = 4.68 GiB (4.90 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'
llama_model_load_from_file_impl: failed to load model
time=2025-11-01T00:53:20.257Z level=INFO source=sched.go:418 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89 error="unable to load model: /root/.ollama/models/blobs/sha256-108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89"
[GIN] 2025/11/01 - 00:53:20 | 500 | 889.851487ms | 127.0.0.1 | POST "/api/generate"

root@4a9887ae8864:/# ollama -v
ollama version is 0.12.8
root@4a9887ae8864:/#
root@4a9887ae8864:/#
root@4a9887ae8864:/#

Relevant log output


OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.12.8

Originally created by @jiaolongxue on GitHub (Nov 1, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12897 ### What is the issue? print_info: file size = 4.68 GiB (4.90 BPW) llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl' llama_model_load_from_file_impl: failed to load model time=2025-11-01T00:53:20.257Z level=INFO source=sched.go:418 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89 error="unable to load model: /root/.ollama/models/blobs/sha256-108e7ff92b78eefd3db4741885104acba514255c11b617d3c7b197a5f46efe89" [GIN] 2025/11/01 - 00:53:20 | 500 | 889.851487ms | 127.0.0.1 | POST "/api/generate" root@4a9887ae8864:/# ollama -v ollama version is 0.12.8 root@4a9887ae8864:/# root@4a9887ae8864:/# root@4a9887ae8864:/# ### Relevant log output ```shell ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.8
GiteaMirror added the bug label 2026-05-04 22:14:25 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 1, 2025):

qwen3vl is supported in the ollama engine. The unsloth model is a split vision model which is not supported in the ollama engine so the ollama server falls back to the llama.cpp engine, which doesn't support qwen3vl yet.

<!-- gh-comment-id:3475889931 --> @rick-github commented on GitHub (Nov 1, 2025): qwen3vl is supported in the ollama engine. The unsloth model is a split vision model which is not supported in the ollama engine so the ollama server falls back to the llama.cpp engine, which doesn't support qwen3vl yet.
Author
Owner

@MakksSh commented on GitHub (Nov 5, 2025):

I'm experiencing the same error while using the official Qwen model from HF (https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking-GGUF).

ollama[499080]: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'

root@... ~ # ollama -v
ollama version is 0.12.9

<!-- gh-comment-id:3492520167 --> @MakksSh commented on GitHub (Nov 5, 2025): I'm experiencing the same error while using the official Qwen model from HF (https://huggingface.co/Qwen/Qwen3-VL-4B-Thinking-GGUF). ollama[499080]: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl' root@... ~ # ollama -v ollama version is 0.12.9
Author
Owner

@MakksSh commented on GitHub (Nov 5, 2025):

Nov 05 20:22:59 llm-server ollama[499080]: [GIN] 2025/11/05 - 20:22:59 | 200 | 733.917µs | 172.18.0.5 | POST "/api/generate"
Nov 05 20:22:59 llm-server ollama[499080]: ggml_backend_cuda_device_get_memory device GPU-abd731bb-dea4-4514-ca3a-6e0f36901746 utilizing NVML memory reporting free: 23752998912 total: 34190917632
Nov 05 20:23:02 llm-server ollama[499080]: time=2025-11-05T20:23:02.980+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46105"
Nov 05 20:23:05 llm-server ollama[499080]: time=2025-11-05T20:23:05.987+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35927"
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: loaded meta data with 30 key-value pairs and 398 tensors from /llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93 (version GGUF V3 (latest))
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 0: general.architecture str = qwen3vl
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 1: general.type str = model
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 2: general.name str = Qwen3Vl 4b Thinking
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 3: general.finetune str = thinking
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 4: general.basename str = qwen3vl
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 5: general.size_label str = 4B
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 6: qwen3vl.block_count u32 = 36
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 7: qwen3vl.context_length u32 = 262144
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 8: qwen3vl.embedding_length u32 = 2560
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 9: qwen3vl.feed_forward_length u32 = 9728
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 10: qwen3vl.attention.head_count u32 = 32
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 11: qwen3vl.attention.head_count_kv u32 = 8
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 12: qwen3vl.rope.freq_base f32 = 5000000.000000
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 13: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 14: qwen3vl.attention.key_length u32 = 128
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 15: qwen3vl.attention.value_length u32 = 128
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 16: general.file_type u32 = 1
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 17: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0]
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 18: qwen3vl.n_deepstack_layers u32 = 3
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 19: general.quantization_version u32 = 2
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 29: tokenizer.chat_template str = {%- set image_count = namespace(value...
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - type f32: 145 tensors
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - type f16: 253 tensors
Nov 05 20:23:07 llm-server ollama[499080]: print_info: file format = GGUF V3 (latest)
Nov 05 20:23:07 llm-server ollama[499080]: print_info: file type = F16
Nov 05 20:23:07 llm-server ollama[499080]: print_info: file size = 7.49 GiB (16.00 BPW)
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl'
Nov 05 20:23:07 llm-server ollama[499080]: llama_model_load_from_file_impl: failed to load model
Nov 05 20:23:07 llm-server ollama[499080]: time=2025-11-05T20:23:07.245+03:00 level=INFO source=sched.go:418 msg="NewLlamaServer failed" model=/llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93 error="unable to load model: /llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93"
Nov 05 20:23:07 llm-server ollama[499080]: [GIN] 2025/11/05 - 20:23:07 | 500 | 1.302249967s | 172.18.0.5 | POST "/api/chat"

<!-- gh-comment-id:3492534252 --> @MakksSh commented on GitHub (Nov 5, 2025): > Nov 05 20:22:59 llm-server ollama[499080]: [GIN] 2025/11/05 - 20:22:59 | 200 | 733.917µs | 172.18.0.5 | POST "/api/generate" Nov 05 20:22:59 llm-server ollama[499080]: ggml_backend_cuda_device_get_memory device GPU-abd731bb-dea4-4514-ca3a-6e0f36901746 utilizing NVML memory reporting free: 23752998912 total: 34190917632 Nov 05 20:23:02 llm-server ollama[499080]: time=2025-11-05T20:23:02.980+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46105" Nov 05 20:23:05 llm-server ollama[499080]: time=2025-11-05T20:23:05.987+03:00 level=INFO source=server.go:400 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 35927" Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: loaded meta data with 30 key-value pairs and 398 tensors from /llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93 (version GGUF V3 (latest)) Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 0: general.architecture str = qwen3vl Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 1: general.type str = model Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 2: general.name str = Qwen3Vl 4b Thinking Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 3: general.finetune str = thinking Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 4: general.basename str = qwen3vl Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 5: general.size_label str = 4B Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 6: qwen3vl.block_count u32 = 36 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 7: qwen3vl.context_length u32 = 262144 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 8: qwen3vl.embedding_length u32 = 2560 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 9: qwen3vl.feed_forward_length u32 = 9728 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 10: qwen3vl.attention.head_count u32 = 32 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 11: qwen3vl.attention.head_count_kv u32 = 8 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 12: qwen3vl.rope.freq_base f32 = 5000000.000000 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 13: qwen3vl.attention.layer_norm_rms_epsilon f32 = 0.000001 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 14: qwen3vl.attention.key_length u32 = 128 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 15: qwen3vl.attention.value_length u32 = 128 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 16: general.file_type u32 = 1 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 17: qwen3vl.rope.dimension_sections arr[i32,4] = [24, 20, 20, 0] Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 18: qwen3vl.n_deepstack_layers u32 = 3 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 19: general.quantization_version u32 = 2 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643 Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - kv 29: tokenizer.chat_template str = {%- set image_count = namespace(value... Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - type f32: 145 tensors Nov 05 20:23:07 llm-server ollama[499080]: llama_model_loader: - type f16: 253 tensors Nov 05 20:23:07 llm-server ollama[499080]: print_info: file format = GGUF V3 (latest) Nov 05 20:23:07 llm-server ollama[499080]: print_info: file type = F16 Nov 05 20:23:07 llm-server ollama[499080]: print_info: file size = 7.49 GiB (16.00 BPW) Nov 05 20:23:07 llm-server ollama[499080]: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vl' Nov 05 20:23:07 llm-server ollama[499080]: llama_model_load_from_file_impl: failed to load model Nov 05 20:23:07 llm-server ollama[499080]: time=2025-11-05T20:23:07.245+03:00 level=INFO source=sched.go:418 msg="NewLlamaServer failed" model=/llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93 error="unable to load model: /llm/ollama_models/blobs/sha256-b0af534e2a0a90d53886c46129de279f45cdd82ca7e513141c9e3a4e14f36d93" Nov 05 20:23:07 llm-server ollama[499080]: [GIN] 2025/11/05 - 20:23:07 | 500 | 1.302249967s | 172.18.0.5 | POST "/api/chat"
Author
Owner

@MakksSh commented on GitHub (Nov 5, 2025):

qwen3vl is supported in the ollama engine. The unsloth model is a split vision model which is not supported in the ollama engine so the ollama server falls back to the llama.cpp engine, which doesn't support qwen3vl yet.

llama.cpp has already added support for qwen3vl models in its engine. Here is the PR: https://github.com/ggml-org/llama.cpp/pull/16780

<!-- gh-comment-id:3492577645 --> @MakksSh commented on GitHub (Nov 5, 2025): > qwen3vl is supported in the ollama engine. The unsloth model is a split vision model which is not supported in the ollama engine so the ollama server falls back to the llama.cpp engine, which doesn't support qwen3vl yet. llama.cpp has already added support for qwen3vl models in its engine. Here is the PR: https://github.com/ggml-org/llama.cpp/pull/16780
Author
Owner

@rick-github commented on GitHub (Nov 5, 2025):

It will be integrated in the next vendor sync.

<!-- gh-comment-id:3492669583 --> @rick-github commented on GitHub (Nov 5, 2025): It will be integrated in the next vendor sync.
Author
Owner

@EnlistedGhost commented on GitHub (Nov 18, 2025):

@MakksSh @jiaolongxue

Just as a note:

GGUF files for Qwen3-VL that are created directly with/using llama.cpp WILL NOT WORK when Ollama attempts to load them!

You must create/quantize the Qwen3-VL model via Ollama itself.

Please refer to the Ollama Importing a model from Safetensors [Click here to see the Ollama documentation]
If you create/import the safetensors for Qwen3-VL with the method from Ollama - it works!!!

<!-- gh-comment-id:3545806919 --> @EnlistedGhost commented on GitHub (Nov 18, 2025): @MakksSh @jiaolongxue ### Just as a note: GGUF files for Qwen3-VL that are created directly with/using llama.cpp **WILL NOT WORK** when Ollama attempts to load them! ### You must create/quantize the Qwen3-VL model via Ollama itself. Please refer to the Ollama Importing a model from Safetensors [[Click here to see the Ollama documentation](https://docs.ollama.com/import#importing-a-model-from-safetensors-weights)] If you create/import the safetensors for Qwen3-VL with the method from Ollama - it works!!!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70606