Ollama model files for Gemma3 specifying mmproj ggufs do not retain vision capability. #6534

Closed
opened 2025-11-12 13:37:01 -06:00 by GiteaMirror · 8 comments
Owner

Originally created by @lkraven on GitHub (Mar 24, 2025).

What is the issue?

When creating an ollama modelfile with two FROM statements, one with the primary model and one with the projector model such as:

ollama create -f gemma3-i-4-gguf gemma3:4b_Q6_K

FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/google_gemma-3-4b-it-Q6_K.gguf
FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/mmproj-google_gemma-3-4b-it-f32.gguf
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user
{{ .Content }}<end_of_turn>
{{ if $last }}<start_of_turn>model
{{ end }}
{{- else if eq .Role "assistant" }}<start_of_turn>model
{{ .Content }}{{ if not $last }}<end_of_turn>
{{ end }}
{{- end }}
{{- end }}"""
PARAMETER temperature 1.0
PARAMETER top_k 64
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.0
PARAMETER stop <end_of_turn>

Even though Ollama shows the CLIP file:

ollama show gemma3:4b_Q6_K

   Model
    architecture        gemma3
    parameters          3.9B
    context length      131072
    embedding length    2560
    quantization        unknown

  Projector
    architecture        clip
    parameters          419.82M
    embedding length    1152
    dimensions          2560

  Parameters
    repeat_penalty    1
    stop              "<end_of_turn>"
    temperature       1
    top_k             64
    top_p             0.95
    min_p             0

When trying to pass an image, this is what you get:

Mar 24 13:13:05 ana-ml1 ollama[3565562]: time=2025-03-24T13:13:05.939-07:00 level=INFO source=server.go:766 msg="llm predict error: Failed to create new sequence: failed to process inputs: this model is missing data required for image input"

Is this the correct way to add an mmproj to a quantized model?

Relevant log output


OS

Debian

GPU

A6000

CPU

No response

Ollama version

0.62

Originally created by @lkraven on GitHub (Mar 24, 2025). What is the issue? When creating an ollama modelfile with two FROM statements, one with the primary model and one with the projector model such as: `ollama create -f gemma3-i-4-gguf gemma3:4b_Q6_K` ``` FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/google_gemma-3-4b-it-Q6_K.gguf FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/mmproj-google_gemma-3-4b-it-f32.gguf TEMPLATE """{{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 }} {{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user {{ .Content }}<end_of_turn> {{ if $last }}<start_of_turn>model {{ end }} {{- else if eq .Role "assistant" }}<start_of_turn>model {{ .Content }}{{ if not $last }}<end_of_turn> {{ end }} {{- end }} {{- end }}""" PARAMETER temperature 1.0 PARAMETER top_k 64 PARAMETER top_p 0.95 PARAMETER min_p 0.0 PARAMETER repeat_penalty 1.0 PARAMETER stop <end_of_turn> ``` Even though Ollama shows the CLIP file: `ollama show gemma3:4b_Q6_K` ``` Model architecture gemma3 parameters 3.9B context length 131072 embedding length 2560 quantization unknown Projector architecture clip parameters 419.82M embedding length 1152 dimensions 2560 Parameters repeat_penalty 1 stop "<end_of_turn>" temperature 1 top_k 64 top_p 0.95 min_p 0 ``` When trying to pass an image, this is what you get: `Mar 24 13:13:05 ana-ml1 ollama[3565562]: time=2025-03-24T13:13:05.939-07:00 level=INFO source=server.go:766 msg="llm predict error: Failed to create new sequence: failed to process inputs: this model is missing data required for image input"` Is this the correct way to add an mmproj to a quantized model? ### Relevant log output ```shell ``` ### OS Debian ### GPU A6000 ### CPU _No response_ ### Ollama version 0.62
GiteaMirror added the bug label 2025-11-12 13:37:01 -06:00
Author
Owner

@pdevine commented on GitHub (Mar 25, 2025):

Hey @lkraven , see my note here on how to quantize to Q6_K.

Bartowski's quantization doesn't include the vision tower inside of the same GGUF so you will need to figure out how to combine the two models if you want to use the bits that you downloaded. The easiest way is to just pull the non-quantized model from HF and convert it yourself using the directions I outlined above.

@pdevine commented on GitHub (Mar 25, 2025): Hey @lkraven , see my note [here](https://github.com/ollama/ollama/issues/9762#issuecomment-2744470349) on how to quantize to `Q6_K`. Bartowski's quantization doesn't include the vision tower inside of the same GGUF so you will need to figure out how to combine the two models if you want to use the bits that you downloaded. The easiest way is to just pull the non-quantized model from HF and convert it yourself using the directions I outlined above.
Author
Owner

@lkraven commented on GitHub (Mar 25, 2025):

Interesting, a lot of the fine-tunes do not contain the vision tower even before quantization. Llama.cpp and Kobold.cpp will still allow the multi-modal projector model extracted from the base model to be used in conjunction with these tunes to retain (or inherit?) the vision capabilities.

I have been able to quantize other fine tunes that retain the vision tower and use their vision capabilities in ollama, but have not been able to add vision capabilities back as I have with the other backends. The documentation seems to suggest I can do it by adding multiple ggufs, but perhaps I'm wrong.

If this is just the way it is, and it's not a bug, then I stand corrected. Thanks for your assistance!

@lkraven commented on GitHub (Mar 25, 2025): Interesting, a lot of the fine-tunes do not contain the vision tower even before quantization. Llama.cpp and Kobold.cpp will still allow the multi-modal projector model extracted from the base model to be used in conjunction with these tunes to retain (or inherit?) the vision capabilities. I have been able to quantize other fine tunes that retain the vision tower and use their vision capabilities in ollama, but have not been able to add vision capabilities back as I have with the other backends. The documentation seems to suggest I can do it by adding multiple ggufs, but perhaps I'm wrong. If this is just the way it is, and it's not a bug, then I stand corrected. Thanks for your assistance!
Author
Owner

@pdevine commented on GitHub (Mar 25, 2025):

@lkraven this is the first model that we've released that uses ollama's engine and not llama.cpp under the covers, so it's not using the same implementation as llama.cpp or kobold.cpp. We did use the method of combining GGUFs like that in the past (for llama3.2 vision for instance) but are moving away from that model and closer to how safetensors works. I definitely recommend just pulling the safetensors weights and doing the conversion.

@pdevine commented on GitHub (Mar 25, 2025): @lkraven this is the first model that we've released that uses ollama's engine and not llama.cpp under the covers, so it's not using the same implementation as llama.cpp or kobold.cpp. We did use the method of combining GGUFs like that in the past (for llama3.2 vision for instance) but are moving away from that model and closer to how safetensors works. I definitely recommend just pulling the safetensors weights and doing the conversion.
Author
Owner

@lkraven commented on GitHub (Mar 25, 2025):

That's fine-- thanks for the information. Some of the fine tunes don't bring the vision tower with them, so it's not possible to run it with vision. But that is also definitely a choice that the people who are doing the fine tuning are making. Closing this since it's not a bug.

@lkraven commented on GitHub (Mar 25, 2025): That's fine-- thanks for the information. Some of the fine tunes don't bring the vision tower with them, so it's not possible to run it with vision. But that is also definitely a choice that the people who are doing the fine tuning are making. Closing this since it's not a bug.
Author
Owner

@sammcj commented on GitHub (Jun 22, 2025):

In case anyone stumbles across this like I did, the correct way to do this in Ollama is to place both the main model GGUF and the mmproj gguf in the same directory and provide the directory path in the FROM directive. PR submitted to Ollama to clarify this in the docs: https://github.com/ollama/ollama/pull/11163/files

@sammcj commented on GitHub (Jun 22, 2025): In case anyone stumbles across this like I did, the correct way to do this in Ollama is to place both the main model GGUF and the mmproj gguf in the same directory and provide the directory path in the FROM directive. PR submitted to Ollama to clarify this in the docs: https://github.com/ollama/ollama/pull/11163/files
Author
Owner

@arbv commented on GitHub (Jul 6, 2025):

@sammcj Does not seem to work for Gemma 3. For some reason the vision qualities are not retained for me, despite the fact that ollama imports the file.

@arbv commented on GitHub (Jul 6, 2025): @sammcj Does not seem to work for Gemma 3. For some reason the vision qualities are not retained for me, despite the fact that ollama imports the file.
Author
Owner

@alisson-anjos commented on GitHub (Jul 6, 2025):

@sammcj Does not seem to work for Gemma 3. For some reason the vision qualities are not retained for me, despite the fact that ollama imports the file.

I have the same problem :(

@alisson-anjos commented on GitHub (Jul 6, 2025): > [@sammcj](https://github.com/sammcj) Does not seem to work for Gemma 3. For some reason the vision qualities are not retained for me, despite the fact that ollama imports the file. I have the same problem :(
Author
Owner

@Ka0Ri commented on GitHub (Jul 16, 2025):

I used llama.cpp to build *.gguf files from gemma3 (lora fine-tuned locally) with this command python convert_hf_to_gguf.py [model_folder] --outtype f16 and --mmproj tag for the vision part. Then, I used modelfile to convert it into ollama model. But ollama worked for the language part only, and got the same error as above. However with the ollama version 0.9.6, I can enable vision capability when building the model directly from safetensors weights, at mentioned here. This is my modelfile brought from official gemma3.

FROM [path_to_safetensors_weights]
TEMPLATE """{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user
{{ .Content }}<end_of_turn>
{{ if $last }}<start_of_turn>model
{{ end }}
{{- else if eq .Role "assistant" }}<start_of_turn>model
{{ .Content }}{{ if not $last }}<end_of_turn>
{{ end }}
{{- end }}
{{- end }}"""
PARAMETER stop <end_of_turn>
PARAMETER temperature 1
PARAMETER top_k 64
PARAMETER top_p 0.95
SYSTEM """You are a Vision Language Model specialized in interpreting visual data from crop images. Your task is to analyze the provided image and respond to queries with concise answers, usually a short phrase about crop type or feature of crop. Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""
@Ka0Ri commented on GitHub (Jul 16, 2025): I used llama.cpp to build *.gguf files from gemma3 (lora fine-tuned locally) with this command `python convert_hf_to_gguf.py [model_folder] --outtype f16` and `--mmproj` tag for the vision part. Then, I used modelfile to convert it into ollama model. But ollama worked for the language part only, and got the same error as above. However with the ollama version `0.9.6`, I can enable vision capability when building the model directly from safetensors weights, at mentioned [here](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#build-from-a-safetensors-model). This is my modelfile brought from official gemma3. ``` FROM [path_to_safetensors_weights] TEMPLATE """{{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 }} {{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user {{ .Content }}<end_of_turn> {{ if $last }}<start_of_turn>model {{ end }} {{- else if eq .Role "assistant" }}<start_of_turn>model {{ .Content }}{{ if not $last }}<end_of_turn> {{ end }} {{- end }} {{- end }}""" PARAMETER stop <end_of_turn> PARAMETER temperature 1 PARAMETER top_k 64 PARAMETER top_p 0.95 SYSTEM """You are a Vision Language Model specialized in interpreting visual data from crop images. Your task is to analyze the provided image and respond to queries with concise answers, usually a short phrase about crop type or feature of crop. Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.""" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#6534