[GH-ISSUE #12321] Gemma 3n not working #54699

Closed
opened 2026-04-29 06:59:14 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Jeremy-Developer-Page on GitHub (Sep 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12321

What is the issue?

If you take the latest version of gemma3n E4B it from hugging face and you run it on ollama (ollama convert model and ollama run model) you will see that the multimodal part doesn’t work on ollama.

Step back from 0.11.0 I don’t Know what you have made but all multimodal models (ex gemma 3 12B standard) from hugging face converted with ollama lose the multimodal part BUT if you use 0.10.1 you can convert correctly multimodal models (like gemma 3 12B standard) and gemma 3n too but gemma 3n still text only when the original model suopports the images and audio too!

So why ollama doesn’t support that? And why from 0.11 I can’t convert correctly multimodal models from hugging face and all the times I need to uninstall the latest release, install 10.1 convert and then upgrade to the latest? It’s absurd! Fix these 2 parts

Thanks

Relevant log output


OS

macOS

GPU

M4

CPU

M4

Ollama version

0.11.11 and 0.10.1

Originally created by @Jeremy-Developer-Page on GitHub (Sep 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12321 ### What is the issue? If you take the latest version of gemma3n E4B it from hugging face and you run it on ollama (ollama convert model and ollama run model) you will see that the multimodal part doesn’t work on ollama. Step back from 0.11.0 I don’t Know what you have made but all multimodal models (ex gemma 3 12B standard) from hugging face converted with ollama lose the multimodal part BUT if you use 0.10.1 you can convert correctly multimodal models (like gemma 3 12B standard) and gemma 3n too but gemma 3n still text only when the original model suopports the images and audio too! So why ollama doesn’t support that? And why from 0.11 I can’t convert correctly multimodal models from hugging face and all the times I need to uninstall the latest release, install 10.1 convert and then upgrade to the latest? It’s absurd! Fix these 2 parts Thanks ### Relevant log output ```shell ``` ### OS macOS ### GPU M4 ### CPU M4 ### Ollama version 0.11.11 and 0.10.1
GiteaMirror added the bugneeds more info labels 2026-04-29 06:59:14 -05:00
Author
Owner

@pdevine commented on GitHub (Sep 17, 2025):

Ollama doesn't support the audio part of gemma3n (at least not yet). Can you include a link to the model you're trying to convert? Is there a reason not to use the model directly from ollama?

<!-- gh-comment-id:3304396947 --> @pdevine commented on GitHub (Sep 17, 2025): Ollama doesn't support the audio part of gemma3n (at least not yet). Can you include a link to the model you're trying to convert? Is there a reason not to use the model directly from ollama?
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 17, 2025):

Hi the model is: https://huggingface.co/google/gemma-3n-E4B-it and as you can see the multimodal part is image, audio and text. Now Audio is not supported and it's ok but text and images are supported by ollama. I'm trying to convert manually because the model on ollama library isn't the it version and is text only. I mean is possible to convert only text and vision model just like the other multimodal models available?

If you want to be sure and have a mac you can try to convert this model with mlx-vlm you'll notice that this conversion support the vision capabilities so this model can analyze images and the limit is of ollama that does't access correctly to the multimodal functions of this model during the conversion

Thanks

<!-- gh-comment-id:3304423341 --> @Jeremy-Developer-Page commented on GitHub (Sep 17, 2025): Hi the model is: https://huggingface.co/google/gemma-3n-E4B-it and as you can see the multimodal part is image, audio and text. Now Audio is not supported and it's ok but text and images are supported by ollama. I'm trying to convert manually because the model on ollama library isn't the it version and is text only. I mean is possible to convert only text and vision model just like the other multimodal models available? If you want to be sure and have a mac you can try to convert this model with mlx-vlm you'll notice that this conversion support the vision capabilities so this model can analyze images and the limit is of ollama that does't access correctly to the multimodal functions of this model during the conversion Thanks
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 17, 2025):

But you can have the same issue with the base model (non it) https://huggingface.co/google/gemma-3n-E4B. Infact if you look to the ollama library it's recognized for only text. (I've spent one day to make test before write this issue. So please consider to analyze the image structure and support it in ollama. Also because this is a very fast model on any device, if Ollama also supported image recognition, it would literally be a game changer for many users, even those with lower-end computers.

I remain available in case of further questions, tests, or clarifications.
Thank you.
Regards

Jeremy

<!-- gh-comment-id:3304449534 --> @Jeremy-Developer-Page commented on GitHub (Sep 17, 2025): But you can have the same issue with the base model (non it) https://huggingface.co/google/gemma-3n-E4B. Infact if you look to the ollama library it's recognized for only text. (I've spent one day to make test before write this issue. So please consider to analyze the image structure and support it in ollama. Also because this is a very fast model on any device, if Ollama also supported image recognition, it would literally be a game changer for many users, even those with lower-end computers. I remain available in case of further questions, tests, or clarifications. Thank you. Regards Jeremy
Author
Owner

@pdevine commented on GitHub (Sep 17, 2025):

@Jeremy-Developer-Page Ah crap, I just realized we never did the ImageNet implementation for this model (it came out around the same time as a bunch of other models we were working on). We did do the image part for gemma3 which uses SigLip (this model essentially combined paligemma and gemma2), but ImageNet is a pretty different architecture.

I'm going to close this as a dupe of #10792 since we were already tracking it. Really sorry about the confusion.

<!-- gh-comment-id:3304473047 --> @pdevine commented on GitHub (Sep 17, 2025): @Jeremy-Developer-Page Ah crap, I just realized we never did the ImageNet implementation for this model (it came out around the same time as a bunch of other models we were working on). We did do the image part for gemma3 which uses SigLip (this model essentially combined paligemma and gemma2), but ImageNet is a pretty different architecture. I'm going to close this as a dupe of #10792 since we were already tracking it. Really sorry about the confusion.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54699