[GH-ISSUE #10792] Gemma 3n #32845

Open
opened 2026-04-22 14:42:46 -05:00 by GiteaMirror · 39 comments
Owner

Originally created by @diegovalenzuelaiturra on GitHub (May 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10792

Originally assigned to: @mxyng on GitHub.

Would it be possible to support Gemma 3n ?

Originally created by @diegovalenzuelaiturra on GitHub (May 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10792 Originally assigned to: @mxyng on GitHub. Would it be possible to support Gemma 3n ?
GiteaMirror added the model label 2026-04-22 14:42:46 -05:00
Author
Owner

@groundstation1 commented on GitHub (May 27, 2025):

made llama.cpp request on their idea board: https://github.com/ggml-org/llama.cpp/discussions/13831
(hope that's the right place. That's where the bulk of the work has to be done, right?)

<!-- gh-comment-id:2913117161 --> @groundstation1 commented on GitHub (May 27, 2025): made llama.cpp request on their idea board: https://github.com/ggml-org/llama.cpp/discussions/13831 (hope that's the right place. That's where the bulk of the work has to be done, right?)
Author
Owner

@chriszs commented on GitHub (Jun 11, 2025):

When asked about whether the current LiteRT (formerly Tensorflow) weights would ever be released in GGUF, one Google developer relations person said, "Yes, this repository is just a preview. Stay tuned!" More recently, another said, "We are working hard on making Gemma 3n 4B available in popular open source frameworks! Stay tuned." So, I guess we should "stay tuned."

<!-- gh-comment-id:2961407120 --> @chriszs commented on GitHub (Jun 11, 2025): When asked about whether the current LiteRT (formerly Tensorflow) weights would ever be released in GGUF, one Google developer relations person [said](https://huggingface.co/google/gemma-3n-E4B-it-litert-preview/discussions/11#68344aef0c0aff775f05b1b4), "Yes, this repository is just a preview. Stay tuned!" More recently, another [said](https://x.com/_philschmid/status/1932333817789870329), "We are working hard on making Gemma 3n 4B available in popular open source frameworks! Stay tuned." So, I guess we should "stay tuned."
Author
Owner

@rick-github commented on GitHub (Jun 26, 2025):

https://ollama.com/library/gemma3n. Requires 0.9.3. Note that this version is not multi-modal, text generation only.

<!-- gh-comment-id:3009088156 --> @rick-github commented on GitHub (Jun 26, 2025): https://ollama.com/library/gemma3n. Requires [0.9.3](https://github.com/ollama/ollama/releases/tag/v0.9.3). Note that this version is not multi-modal, text generation only.
Author
Owner

@feynon commented on GitHub (Jun 26, 2025):

Do we expect multimodal generation soon? Is it blocked on downstream llama.cpp in any way?

<!-- gh-comment-id:3009135075 --> @feynon commented on GitHub (Jun 26, 2025): Do we expect multimodal generation soon? Is it blocked on downstream llama.cpp in any way?
Author
Owner

@The-Best-Codes commented on GitHub (Jun 26, 2025):

Would love to see support now that Gemma 3n is stable:
https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4

There are also community GGUFs:
https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF

<!-- gh-comment-id:3009231964 --> @The-Best-Codes commented on GitHub (Jun 26, 2025): Would love to see support now that Gemma 3n is stable: https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4 There are also community GGUFs: https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF
Author
Owner

@mxyng commented on GitHub (Jun 26, 2025):

Vision is a work in progress

<!-- gh-comment-id:3009619264 --> @mxyng commented on GitHub (Jun 26, 2025): Vision is a work in progress
Author
Owner

@jonigl commented on GitHub (Jun 26, 2025):

https://ollama.com/library/gemma3n. Requires 0.9.3. Note that this version is not multi-modal, text generation only.

Would love to see Tools support added to this model! Are there any plans to support it?

<!-- gh-comment-id:3009785759 --> @jonigl commented on GitHub (Jun 26, 2025): > https://ollama.com/library/gemma3n. Requires [0.9.3](https://github.com/ollama/ollama/releases/tag/v0.9.3). Note that this version is not multi-modal, text generation only. Would love to see Tools support added to this model! Are there any plans to support it?
Author
Owner

@rick-github commented on GitHub (Jun 26, 2025):

The model doesn't support tool use. You can hack it in like was done for gemma3, but the quality of response may be poor.

<!-- gh-comment-id:3009793436 --> @rick-github commented on GitHub (Jun 26, 2025): The model doesn't support tool use. You can hack it in like was done for [gemma3](https://ollama.com/search?q=gemma3+tool), but the quality of response may be poor.
Author
Owner

@rick-github commented on GitHub (Jun 26, 2025):

As I suspected, it's not a great tool user, but it actually did better than I expected. One thing I noticed is that it gets noticeably slower as the context increases, so perhaps not good for multi-turn tools calls.

FROM gemma3n:latest
TEMPLATE """{{- if or .System .Tools }}<start_of_turn>user
{{ .System }}
{{- if .Tools }}

You can use these tools to help answer the user's question:
{{- range .Tools }}
{{ . }}
{{- end }}

When you need to use a tool, format your response as JSON as follows:
```tool
{"name": "tool_name", "parameters": {"param1": "value1", "param2": "value2"}}
```
{{- end }}<end_of_turn>
<start_of_turn>model
I understand.
<end_of_turn>
{{- end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<start_of_turn>user
{{ .Content }}<end_of_turn>
{{ if $last }}<start_of_turn>model
{{ end }}
{{- else if eq .Role "assistant" }}<start_of_turn>model
{{- if .ToolCalls }}
```tool
{{- range .ToolCalls }}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}
{{- end }}
```
{{- else }}
{{ .Content }}
{{- end }}{{ if not $last }}<end_of_turn>
{{ end }}
{{- else if eq .Role "tool" }}<start_of_turn>user
<tool_response>
{{ .Content }}
</tool_response><end_of_turn>
{{ if $last }}<start_of_turn>model
{{ end }}
{{- end }}
{{- end }}"""
$ ./ollama-tool.py --model gemma3n-tools --tool all --test
>>> what time is it?
calling get_datetime({'timezone_name': None})
The current time is Thursday, June 26, 2025 at 22:27.
>>> what is 10 ^ 1.34?
calling power({'x': '10', 'y': '1.34'})
21.87761623949553
>>> what operating system are you running on?
calling get_operating_system({})
I am running on Linux-6.8.0-47-generic-x86_64-with-glibc2.39.
>>> what is my ip address?
calling add({})
calling add({})
calling add({})
calling add({})
calling add({})
You are right to point that out. It seems I am stuck in a loop. I am unable to determine your IP address with the tools available. I am a language model and do not have the ability to directly access your IP address. I apologize for the repeated error.

To find your IP address, you would typically:

*   **Search on Google:** Simply search "what is my ip" on Google.
*   **Use a website:** Visit a website like [https://www.whatismyip.com/](https://www.whatismyip.com/)

I am limited in what I can do, and I am still under development. I appreciate you bringing this to my attention.




>>> what is my location?
calling get_location({})
The location is:

*   **IP Address:** 
*   **Latitude:** 
*   **Longitude:** 
*   **Country:** Switzerland
*   **Region:** Zurich
*   **City:** Zurich

This information is based on the IP address lookup provided in the HTML response.
>>> What's the weather where I am?
calling get_weather({'city': 'Zurich', 'country': 'Switzerland', 'unit': 'unit'})
The weather in Zurich, Switzerland is mild. The current temperature is 20°C, but it feels like 20°C. The forecast for today is 29/15°C with wind from the South at 9 km/h. The humidity is 58%.
<!-- gh-comment-id:3010028240 --> @rick-github commented on GitHub (Jun 26, 2025): As I suspected, it's not a great tool user, but it actually did better than I expected. One thing I noticed is that it gets noticeably slower as the context increases, so perhaps not good for multi-turn tools calls. ````dockerfile FROM gemma3n:latest TEMPLATE """{{- if or .System .Tools }}<start_of_turn>user {{ .System }} {{- if .Tools }} You can use these tools to help answer the user's question: {{- range .Tools }} {{ . }} {{- end }} When you need to use a tool, format your response as JSON as follows: ```tool {"name": "tool_name", "parameters": {"param1": "value1", "param2": "value2"}} ``` {{- end }}<end_of_turn> <start_of_turn>model I understand. <end_of_turn> {{- end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 }} {{- if eq .Role "user" }}<start_of_turn>user {{ .Content }}<end_of_turn> {{ if $last }}<start_of_turn>model {{ end }} {{- else if eq .Role "assistant" }}<start_of_turn>model {{- if .ToolCalls }} ```tool {{- range .ToolCalls }} {"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}} {{- end }} ``` {{- else }} {{ .Content }} {{- end }}{{ if not $last }}<end_of_turn> {{ end }} {{- else if eq .Role "tool" }}<start_of_turn>user <tool_response> {{ .Content }} </tool_response><end_of_turn> {{ if $last }}<start_of_turn>model {{ end }} {{- end }} {{- end }}""" ```` ```console $ ./ollama-tool.py --model gemma3n-tools --tool all --test >>> what time is it? calling get_datetime({'timezone_name': None}) The current time is Thursday, June 26, 2025 at 22:27. >>> what is 10 ^ 1.34? calling power({'x': '10', 'y': '1.34'}) 21.87761623949553 >>> what operating system are you running on? calling get_operating_system({}) I am running on Linux-6.8.0-47-generic-x86_64-with-glibc2.39. >>> what is my ip address? calling add({}) calling add({}) calling add({}) calling add({}) calling add({}) You are right to point that out. It seems I am stuck in a loop. I am unable to determine your IP address with the tools available. I am a language model and do not have the ability to directly access your IP address. I apologize for the repeated error. To find your IP address, you would typically: * **Search on Google:** Simply search "what is my ip" on Google. * **Use a website:** Visit a website like [https://www.whatismyip.com/](https://www.whatismyip.com/) I am limited in what I can do, and I am still under development. I appreciate you bringing this to my attention. >>> what is my location? calling get_location({}) The location is: * **IP Address:** * **Latitude:** * **Longitude:** * **Country:** Switzerland * **Region:** Zurich * **City:** Zurich This information is based on the IP address lookup provided in the HTML response. >>> What's the weather where I am? calling get_weather({'city': 'Zurich', 'country': 'Switzerland', 'unit': 'unit'}) The weather in Zurich, Switzerland is mild. The current temperature is 20°C, but it feels like 20°C. The forecast for today is 29/15°C with wind from the South at 9 km/h. The humidity is 58%. ```
Author
Owner

@quaintdev commented on GitHub (Jul 8, 2025):

Any update on this? Gemma 3n on mobile is fantastic. See below.

Image
<!-- gh-comment-id:3048940206 --> @quaintdev commented on GitHub (Jul 8, 2025): Any update on this? Gemma 3n on mobile is fantastic. See below. <img width="1080" height="2400" alt="Image" src="https://github.com/user-attachments/assets/33072912-fc8c-4194-bc3a-36c85a476137" />
Author
Owner

@KopfKrieg commented on GitHub (Jul 8, 2025):

What App is that 👀

<!-- gh-comment-id:3049078312 --> @KopfKrieg commented on GitHub (Jul 8, 2025): What App is that 👀
Author
Owner

@quaintdev commented on GitHub (Jul 8, 2025):

It's Google's AI Edge Gallery app. Check it out https://github.com/google-ai-edge/gallery

<!-- gh-comment-id:3049090198 --> @quaintdev commented on GitHub (Jul 8, 2025): It's Google's AI Edge Gallery app. Check it out https://github.com/google-ai-edge/gallery
Author
Owner

@barnabywalters commented on GitHub (Jul 15, 2025):

Are there any plans/roadmap to get gemma3n’s audio input capabilities working with ollama?

<!-- gh-comment-id:3075899245 --> @barnabywalters commented on GitHub (Jul 15, 2025): Are there any plans/roadmap to get gemma3n’s audio input capabilities working with ollama?
Author
Owner

@rick-github commented on GitHub (Jul 15, 2025):

Audio is a different modality to those currently supported by ollama. The developers have indicated adding other types of data input is on the roadmap. Whether that includes gemma3n remains to be seen.

<!-- gh-comment-id:3075971045 --> @rick-github commented on GitHub (Jul 15, 2025): Audio is a different modality to those currently supported by ollama. The developers have indicated adding other types of data input is on the roadmap. Whether that includes gemma3n remains to be seen.
Author
Owner

@jy1989 commented on GitHub (Jul 17, 2025):

I think it should be marked on the official website that multimodality is not supported. There is no error when identifying images, and I thought there was a bug with my program. 😂

<!-- gh-comment-id:3082262113 --> @jy1989 commented on GitHub (Jul 17, 2025): I think it should be marked on the official website that multimodality is not supported. There is no error when identifying images, and I thought there was a bug with my program. 😂
Author
Owner

@rick-github commented on GitHub (Jul 17, 2025):

The model as listed in the ollama library does not have the vision tag. When the model supports vision, the tag will be added.

<!-- gh-comment-id:3082296064 --> @rick-github commented on GitHub (Jul 17, 2025): The model as listed in the ollama library does not have the `vision` tag. When the model supports vision, the tag will be added.
Author
Owner

@luewolf commented on GitHub (Jul 17, 2025):

One of the main selling points of Gemma3n for me is the selective parameter activation technology to reduce resource requirements. Does Ollama support this feature yet?

<!-- gh-comment-id:3083862706 --> @luewolf commented on GitHub (Jul 17, 2025): One of the main selling points of Gemma3n for me is the selective parameter activation technology to reduce resource requirements. Does Ollama support this feature yet?
Author
Owner

@quaintdev commented on GitHub (Jul 18, 2025):

Google has created it's own runtime for models like Gemma3n. It's called LiteRT-LM and you can use it on desktop. I had raise issue with them regarding vision modality. They said it will be supported by end of this month

https://github.com/google-ai-edge/LiteRT-LM

<!-- gh-comment-id:3087692504 --> @quaintdev commented on GitHub (Jul 18, 2025): Google has created it's own runtime for models like Gemma3n. It's called LiteRT-LM and you can use it on desktop. I had raise issue with them regarding vision modality. They said it will be supported by end of this month https://github.com/google-ai-edge/LiteRT-LM
Author
Owner

@QtRoS commented on GitHub (Jul 18, 2025):

+1 to @luewolf

It's said that:

A single 4B model natively includes a 2B submodel, allowing you to dynamically trade off performance and quality on the fly

While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory

It's there any chance that these options will be supported by Ollama?
Sources: one, two

<!-- gh-comment-id:3090140862 --> @QtRoS commented on GitHub (Jul 18, 2025): +1 to @luewolf It's said that: > A single 4B model natively includes a 2B submodel, allowing you to dynamically trade off performance and quality on the fly > While their raw parameter count is 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory It's there any chance that these options will be supported by Ollama? Sources: [one](https://www.kaggle.com/competitions/google-gemma-3n-hackathon), [two](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/)
Author
Owner

@Larry-Gan commented on GitHub (Jul 25, 2025):

Dang kinda sucks that nobody supports gemma 3n multimodal anywhere lol, even google doesn't have it on ai studio...

<!-- gh-comment-id:3116607620 --> @Larry-Gan commented on GitHub (Jul 25, 2025): Dang kinda sucks that nobody supports gemma 3n multimodal anywhere lol, even google doesn't have it on ai studio...
Author
Owner

@bmachek commented on GitHub (Jul 25, 2025):

Dang kinda sucks that nobody supports gemma 3n multimodal anywhere lol, even google doesn't have it on ai studio...

I think LM Studio already got it done.

<!-- gh-comment-id:3116891914 --> @bmachek commented on GitHub (Jul 25, 2025): > Dang kinda sucks that nobody supports gemma 3n multimodal anywhere lol, even google doesn't have it on ai studio... I think LM Studio already got it done.
Author
Owner

@The-Best-Codes commented on GitHub (Jul 25, 2025):

@Larry-Gan @bmachek

The model is here on Ollama:
https://ollama.com/library/gemma3n

It works with llama.cpp as well.

<!-- gh-comment-id:3117739021 --> @The-Best-Codes commented on GitHub (Jul 25, 2025): @Larry-Gan @bmachek The model is here on Ollama: https://ollama.com/library/gemma3n It works with llama.cpp as well.
Author
Owner

@bmachek commented on GitHub (Jul 25, 2025):

@Larry-Gan @bmachek

The model is here on Ollama: https://ollama.com/library/gemma3n

It works with llama.cpp as well.

@The-Best-Codes But multimodal functionality is only there in LM Studio for now.

<!-- gh-comment-id:3117800315 --> @bmachek commented on GitHub (Jul 25, 2025): > [@Larry-Gan](https://github.com/Larry-Gan) [@bmachek](https://github.com/bmachek) > > The model is here on Ollama: https://ollama.com/library/gemma3n > > It works with llama.cpp as well. @The-Best-Codes But multimodal functionality is only there in LM Studio for now.
Author
Owner

@QtRoS commented on GitHub (Jul 25, 2025):

One of the main selling points of Gemma3n for me is the selective parameter activation technology to reduce resource requirements. Does Ollama support this feature yet?

Seems that's only about GPU memory requirements:

With Per-Layer Embeddings, you can use Gemma 3n E2B while only having ~2B parameters loaded in your accelerator.

And it already works in LLama.CPP: https://github.com/ggml-org/llama.cpp/pull/14400#issuecomment-3011838558

<!-- gh-comment-id:3120408735 --> @QtRoS commented on GitHub (Jul 25, 2025): > One of the main selling points of Gemma3n for me is the selective parameter activation technology to reduce resource requirements. Does Ollama support this feature yet? Seems that's only about [GPU memory requirements](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/): > With Per-Layer Embeddings, you can use Gemma 3n E2B while only having ~2B parameters loaded in your accelerator. And it already works in LLama.CPP: https://github.com/ggml-org/llama.cpp/pull/14400#issuecomment-3011838558
Author
Owner

@Larry-Gan commented on GitHub (Aug 4, 2025):

@bmachek
Are you referring to this LM studio?
https://lmstudio.ai/models/google/gemma-3n-e4b

It's telling me:
GGUFs are currently text-only. We are working to expand capabilities and remove this limitation.

<!-- gh-comment-id:3149214761 --> @Larry-Gan commented on GitHub (Aug 4, 2025): @bmachek Are you referring to this LM studio? https://lmstudio.ai/models/google/gemma-3n-e4b It's telling me: GGUFs are currently text-only. We are working to expand capabilities and remove this limitation.
Author
Owner

@bmachek commented on GitHub (Aug 4, 2025):

@Larry-Gan I was using the MLX version with LM studio.

<!-- gh-comment-id:3149778379 --> @bmachek commented on GitHub (Aug 4, 2025): @Larry-Gan I was using the MLX version with LM studio.
Author
Owner

@curious-boy-007 commented on GitHub (Sep 14, 2025):

@diegovalenzuelaiturra @Larry-Gan
I notice there is a GGUF solution for Gemma3n multimodal input:

<!-- gh-comment-id:3289056791 --> @curious-boy-007 commented on GitHub (Sep 14, 2025): @diegovalenzuelaiturra @Larry-Gan I notice there is a GGUF solution for Gemma3n multimodal input: - https://sdk.nexa.ai/model/Gemma3n-E4B - https://nexa.ai/blogs/gemma3n
Author
Owner

@recrudesce commented on GitHub (Sep 16, 2025):

Gemma3n can be pulled via Ollama from the models library and works fine - I just tested it.

<!-- gh-comment-id:3297302526 --> @recrudesce commented on GitHub (Sep 16, 2025): Gemma3n can be pulled via Ollama from the models library and works fine - I just tested it.
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 17, 2025):

@recrudesce

Gemma3n can be pulled via Ollama from the models library and works fine - I just tested it.

Does it work well for image analysis too? Because I understand that it only works perfectly for text-only prompts.

<!-- gh-comment-id:3304514956 --> @Jeremy-Developer-Page commented on GitHub (Sep 17, 2025): @recrudesce > Gemma3n can be pulled via Ollama from the models library and works fine - I just tested it. Does it work well for image analysis too? Because I understand that it only works perfectly for text-only prompts.
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 17, 2025):

The problem in this case for the multimodal support of gemma3n and gemma3n-it is that the base for the image recognition is ImageNet, that if I've understood well isn't supported at this time from Ollama. So when a model is converted in gguf model from ollama it looses the vision capabilities but if the same model is converted in mlx-vlm have the vision functions. Now, since MLX is restricted to Mac users only, and since I have noticed that it has a performance degradation of a few tokens/s compared to the version running on Ollama (5 on average on a Mac Mini M4). I would say that the key to making it work is:

  1. Understanding how to insert the ImageNet functionality into the Ollama engine.
  2. Understanding why, from version 0.11.0 onwards, all multimodal (vision+text) model conversions do not work and lose the vision part.
    (For this second point, I would ask for help to @dhiltgen , as you probably know more about it. Since I saw that he had released 0.11.0, then I don't know, so let me know and correct me if I'm wrong)

Thank you all in advance to all, and good luck in solving the problems and helping as many people as possible.

Jeremy

<!-- gh-comment-id:3304559141 --> @Jeremy-Developer-Page commented on GitHub (Sep 17, 2025): The problem in this case for the multimodal support of `gemma3n` and `gemma3n-it` is that the base for the image recognition is ImageNet, that if I've understood well isn't supported at this time from Ollama. So when a model is converted in gguf model from ollama it looses the vision capabilities but if the same model is converted in mlx-vlm have the vision functions. Now, since MLX is restricted to Mac users only, and since I have noticed that it has a performance degradation of a few tokens/s compared to the version running on Ollama (5 on average on a Mac Mini M4). I would say that the key to making it work is: 1. Understanding how to insert the ImageNet functionality into the Ollama engine. 2. Understanding why, from version 0.11.0 onwards, all multimodal (vision+text) model conversions do not work and lose the vision part. (For this second point, I would ask for help to @dhiltgen , as you probably know more about it. Since I saw that he had released 0.11.0, then I don't know, so let me know and correct me if I'm wrong) Thank you all in advance to all, and good luck in solving the problems and helping as many people as possible. Jeremy
Author
Owner

@pdevine commented on GitHub (Sep 18, 2025):

@Jeremy-Developer-Page for ImageNet/MobileNet it would need to be implemented in /model/models/gemma3n. @mxyng I think has a partial implementation.

For the second point, it was a little tricky to pin down, but I figured out that ollama convert for gemma3 was leaving all of the image weights as BrainFloat16 instead of converting them to FP16. Unfortunately on metal the GGML library is missing a BrainFloat16 kernel for the im2col function (it's only implemented in FP32 and FP16) so after converting the weights it will blow up. I have a fix for the conversion (#12324) which is now in main and should be fixed in the next release.

<!-- gh-comment-id:3304993113 --> @pdevine commented on GitHub (Sep 18, 2025): @Jeremy-Developer-Page for ImageNet/MobileNet it would need to be implemented in `/model/models/gemma3n`. @mxyng I think has a partial implementation. For the second point, it was a little tricky to pin down, but I figured out that `ollama convert` for gemma3 was leaving all of the image weights as BrainFloat16 instead of converting them to FP16. Unfortunately on metal the GGML library is missing a BrainFloat16 kernel for the `im2col` function (it's only implemented in FP32 and FP16) so after converting the weights it will blow up. I have a fix for the conversion (#12324) which is now in `main` and should be fixed in the next release.
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 19, 2025):

Okay, so @pdevine and @mxyng, the files to create and configure would be “gemma3n/model vision.go,” “ gemma3n/process image.go,” and “ gemma3n/embed.go.” Am I correct, or is it different since the gemma3n model is particularly different from the standard gemma3? But this is the first time I've dealt with these files, which is why I'm asking.

<!-- gh-comment-id:3311251688 --> @Jeremy-Developer-Page commented on GitHub (Sep 19, 2025): Okay, so @pdevine and @mxyng, the files to create and configure would be “gemma3n/model vision.go,” “ gemma3n/process image.go,” and “ gemma3n/embed.go.” Am I correct, or is it different since the gemma3n model is particularly different from the standard gemma3? But this is the first time I've dealt with these files, which is why I'm asking.
Author
Owner

@pdevine commented on GitHub (Sep 19, 2025):

embed.go includes the forward pass for generating embeddings for the embeddinggemma model. gemma3n/model_vision.go would need to be implemented to include the forward pass for MobileNet-V5 which is different than gemma3 which implements SigLip.

You can find the white paper for gemma3 here, and more about MobileNet-V5 here.

<!-- gh-comment-id:3313084377 --> @pdevine commented on GitHub (Sep 19, 2025): `embed.go` includes the forward pass for generating embeddings for the `embeddinggemma` model. `gemma3n/model_vision.go` would need to be implemented to include the forward pass for MobileNet-V5 which is different than gemma3 which implements SigLip. You can find the white paper for gemma3 [here](https://arxiv.org/pdf/2503.19786), and more about MobileNet-V5 [here](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/#mobilenet-v5:-new-state-of-the-art-vision-encoder).
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 20, 2025):

I tried to implement that part, but with poor results, since I can't get the program to use the vision model.

<!-- gh-comment-id:3314862532 --> @Jeremy-Developer-Page commented on GitHub (Sep 20, 2025): I tried to implement that part, but with poor results, since I can't get the program to use the vision model.
Author
Owner

@pdevine commented on GitHub (Sep 20, 2025):

@Jeremy-Developer-Page yes it's not an easy implementation. SigLip and MobileNet are quite different from each other.

<!-- gh-comment-id:3315370669 --> @pdevine commented on GitHub (Sep 20, 2025): @Jeremy-Developer-Page yes it's not an easy implementation. SigLip and MobileNet are quite different from each other.
Author
Owner

@Jeremy-Developer-Page commented on GitHub (Sep 23, 2025):

Hi people, there are some news from other devs?

<!-- gh-comment-id:3325466701 --> @Jeremy-Developer-Page commented on GitHub (Sep 23, 2025): Hi people, there are some news from other devs?
Author
Owner

@Android-Artisan commented on GitHub (Oct 17, 2025):

Vision is a work in progress

@mxyng any news

<!-- gh-comment-id:3414421009 --> @Android-Artisan commented on GitHub (Oct 17, 2025): > Vision is a work in progress @mxyng any news
Author
Owner

@chllei commented on GitHub (Jan 9, 2026):

Are there any updates regarding this model’s multimodal and tool-calling capabilities?

<!-- gh-comment-id:3727471707 --> @chllei commented on GitHub (Jan 9, 2026): Are there any updates regarding this model’s multimodal and tool-calling capabilities?
Author
Owner

@reedmayhew18 commented on GitHub (Feb 15, 2026):

Checking in on multi-modal support for both images and audio on this model. Thank you!

If there's any specific areas that you are stuck on with this process that you could use help on, please let me/us know.

<!-- gh-comment-id:3903383522 --> @reedmayhew18 commented on GitHub (Feb 15, 2026): Checking in on multi-modal support for both images and audio on this model. Thank you! If there's any specific areas that you are stuck on with this process that you could use help on, please let me/us know.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32845