[GH-ISSUE #13101] ollama run qwen3-vl-4b-instruct error #70731

Closed
opened 2026-05-04 22:46:51 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @yuqikong on GitHub (Nov 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13101

I fine-tuned the qwen3-vl-4b-instruct model using LoRa, and then quantized it using llama.cpp to generate a .gguf file. However, it cannot be run using Ollama and will throw an error.
This is my ModelFile:

FROM ./Qwen3VL-4B-Instruct-F16.gguf
FROM ./mmproj-Qwen3VL-4B-Instruct-F16.gguf

This is the command I used to create the ollama model:

ollama create qwen3-vl-4b -f Modelfile

This is the command I used to run it:
ollama run qwen3-vl-4b

Then the following error was encountered:

Error: 500 Internal Server Error: unable to load model: C:\Users\yqkong2.ollama\models\blobs\sha256-4a6fb1ed053e162887bfc372bd7ffaaa9b8e4049ac6e9b9398ae7277162ec761

Originally created by @yuqikong on GitHub (Nov 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13101 I fine-tuned the qwen3-vl-4b-instruct model using LoRa, and then quantized it using llama.cpp to generate a .gguf file. However, it cannot be run using Ollama and will throw an error. This is my ModelFile: FROM ./Qwen3VL-4B-Instruct-F16.gguf FROM ./mmproj-Qwen3VL-4B-Instruct-F16.gguf This is the command I used to create the ollama model: ollama create qwen3-vl-4b -f Modelfile This is the command I used to run it: ollama run qwen3-vl-4b Then the following error was encountered: Error: 500 Internal Server Error: unable to load model: C:\Users\yqkong2\.ollama\models\blobs\sha256-4a6fb1ed053e162887bfc372bd7ffaaa9b8e4049ac6e9b9398ae7277162ec761
Author
Owner

@rick-github commented on GitHub (Nov 15, 2025):

Server log may help in debugging. Does the Modelfile have a template?

<!-- gh-comment-id:3536674318 --> @rick-github commented on GitHub (Nov 15, 2025): [Server log](https://docs.ollama.com/troubleshooting) may help in debugging. Does the Modelfile have a template?
Author
Owner

@yuqikong commented on GitHub (Nov 15, 2025):

Server log may help in debugging. Does the Modelfile have a template?

The ModelFile doesn't contain a template; it only contains FROM ./Qwen3VL-4B-Instruct-F16.gguf and FROM ./mmproj-Qwen3VL-4B-Instruct-F16.gguf. Is this the main reason? How should this template be written for qwen3-vl-4b-instruct?

<!-- gh-comment-id:3536715262 --> @yuqikong commented on GitHub (Nov 15, 2025): > [Server log](https://docs.ollama.com/troubleshooting) may help in debugging. Does the Modelfile have a template? The ModelFile doesn't contain a template; it only contains `FROM ./Qwen3VL-4B-Instruct-F16.gguf` and `FROM ./mmproj-Qwen3VL-4B-Instruct-F16.gguf`. Is this the main reason? How should this template be written for `qwen3-vl-4b-instruct`?
Author
Owner

@rick-github commented on GitHub (Nov 15, 2025):

I think the problem here is that you've finetuned a model (qwen3-vl) that only runs on the ollama engine, but you are trying to load it as a split model (separate files for text and vision weights), which is not supported on the ollama engine. Because it's not supported, ollama tries to fallback to the llama.cpp engine, but because the vendor sync hasn't moved past b6887 yet, the llama.cpp engine doesn't support it either. If you have the safetensor format of your finetuned model, you can import that to get it to work with the ollama engine.

<!-- gh-comment-id:3536818069 --> @rick-github commented on GitHub (Nov 15, 2025): I think the problem here is that you've finetuned a model (qwen3-vl) that only runs on the ollama engine, but you are trying to load it as a split model (separate files for text and vision weights), which is not supported on the ollama engine. Because it's not supported, ollama tries to fallback to the llama.cpp engine, but because the vendor sync hasn't moved past [b6887](https://github.com/ggml-org/llama.cpp/releases/tag/b6887) yet, the llama.cpp engine doesn't support it either. If you have the safetensor format of your finetuned model, you can [import](https://github.com/ollama/ollama/blob/main/docs/import.mdx#importing-a-model-from-safetensors-weights) that to get it to work with the ollama engine.
Author
Owner

@yuqikong commented on GitHub (Nov 16, 2025):

I think the problem here is that you've finetuned a model (qwen3-vl) that only runs on the ollama engine, but you are trying to load it as a split model (separate files for text and vision weights), which is not supported on the ollama engine. Because it's not supported, ollama tries to fallback to the llama.cpp engine, but because the vendor sync hasn't moved past b6887 yet, the llama.cpp engine doesn't support it either. If you have the safetensor format of your finetuned model, you can import that to get it to work with the ollama engine.

So, if I want to quantize the finely tuned qwen3-vl-4b-instruct into gguf and then run it in ollama, is there any way to do that? I want to quantize it to 8 bits before running it in ollama. I saw that someone uploaded an ollama version of qwen3-vl from the official website, and judging from the file size, it should be quantized. After pulling it down with ollama, it can be run. How is this done?

<!-- gh-comment-id:3537330420 --> @yuqikong commented on GitHub (Nov 16, 2025): > I think the problem here is that you've finetuned a model (qwen3-vl) that only runs on the ollama engine, but you are trying to load it as a split model (separate files for text and vision weights), which is not supported on the ollama engine. Because it's not supported, ollama tries to fallback to the llama.cpp engine, but because the vendor sync hasn't moved past [b6887](https://github.com/ggml-org/llama.cpp/releases/tag/b6887) yet, the llama.cpp engine doesn't support it either. If you have the safetensor format of your finetuned model, you can [import](https://github.com/ollama/ollama/blob/main/docs/import.mdx#importing-a-model-from-safetensors-weights) that to get it to work with the ollama engine. So, if I want to quantize the finely tuned qwen3-vl-4b-instruct into gguf and then run it in ollama, is there any way to do that? I want to quantize it to 8 bits before running it in ollama. I saw that someone uploaded an ollama version of qwen3-vl from the official website, and judging from the file size, it should be quantized. After pulling it down with ollama, it can be run. How is this done?
Author
Owner

@rick-github commented on GitHub (Nov 16, 2025):

If you have the safetensor format of your finetuned model, you can import that to get it to work with the ollama engine.

<!-- gh-comment-id:3537335511 --> @rick-github commented on GitHub (Nov 16, 2025): > If you have the safetensor format of your finetuned model, you can [import](https://github.com/ollama/ollama/blob/main/docs/import.mdx#importing-a-model-from-safetensors-weights) that to get it to work with the ollama engine.
Author
Owner

@yuqikong commented on GitHub (Nov 16, 2025):

If you have the safetensor format of your finetuned model, you can import that to get it to work with the ollama engine.

thank!

<!-- gh-comment-id:3537344936 --> @yuqikong commented on GitHub (Nov 16, 2025): > > If you have the safetensor format of your finetuned model, you can [import](https://github.com/ollama/ollama/blob/main/docs/import.mdx#importing-a-model-from-safetensors-weights) that to get it to work with the ollama engine. thank!
Author
Owner

@bingo1991 commented on GitHub (Dec 19, 2025):

ollama run qwen3-vl:4b
Error: 500 Internal Server Error: model requires more system memory (38.3 GiB) than is available (34.6 GiB)

<!-- gh-comment-id:3673038199 --> @bingo1991 commented on GitHub (Dec 19, 2025): ollama run qwen3-vl:4b Error: 500 Internal Server Error: model requires more system memory (38.3 GiB) than is available (34.6 GiB)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70731