[GH-ISSUE #9387] phi4 multimodal and mini instruct support #6125

Open
opened 2026-04-12 17:27:42 -05:00 by GiteaMirror · 36 comments
Owner
Originally created by @olumolu on GitHub (Feb 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9387 https://huggingface.co/microsoft/Phi-4-multimodal-instruct https://huggingface.co/microsoft/Phi-4-mini-instruct
GiteaMirror added the model label 2026-04-12 17:27:42 -05:00
Author
Owner

@opencoca commented on GitHub (Feb 27, 2025):

Double plus this request.

<!-- gh-comment-id:2687826059 --> @opencoca commented on GitHub (Feb 27, 2025): Double plus this request.
Author
Owner

@ag2s20150909 commented on GitHub (Feb 28, 2025):

https://github.com/ggml-org/llama.cpp/pull/12099

<!-- gh-comment-id:2689490668 --> @ag2s20150909 commented on GitHub (Feb 28, 2025): https://github.com/ggml-org/llama.cpp/pull/12099
Author
Owner

@temsa commented on GitHub (Mar 1, 2025):

Merged in llama-cpp, see https://github.com/ggml-org/llama.cpp/pull/12108

<!-- gh-comment-id:2692172516 --> @temsa commented on GitHub (Mar 1, 2025): Merged in llama-cpp, see https://github.com/ggml-org/llama.cpp/pull/12108
Author
Owner

@DK013 commented on GitHub (Mar 4, 2025):

https://github.com/ggml-org/llama.cpp/pull/12108 includes phi4-mini support. waiting for multimodal

<!-- gh-comment-id:2696293372 --> @DK013 commented on GitHub (Mar 4, 2025): https://github.com/ggml-org/llama.cpp/pull/12108 includes phi4-mini support. waiting for multimodal
Author
Owner

@martin76ec commented on GitHub (Mar 5, 2025):

+1

<!-- gh-comment-id:2699681660 --> @martin76ec commented on GitHub (Mar 5, 2025): +1
Author
Owner

@alexl4321 commented on GitHub (Mar 5, 2025):

phi-4-multimodal would be absolutely awesome. everything else including latest qwen 2.5vl etc sadly suck at real ocr despite what benchmarks claim

<!-- gh-comment-id:2700273232 --> @alexl4321 commented on GitHub (Mar 5, 2025): phi-4-multimodal would be absolutely awesome. everything else including latest qwen 2.5vl etc sadly suck at real ocr despite what benchmarks claim
Author
Owner

@torsteinelv commented on GitHub (Mar 5, 2025):

phi4-multimod++++ :)

<!-- gh-comment-id:2700800683 --> @torsteinelv commented on GitHub (Mar 5, 2025): phi4-multimod++++ :)
Author
Owner

@EKal-aa commented on GitHub (Mar 7, 2025):

phi4-multimod++++ :)

and is there any chance to run Pixtral in ollama sometime?

<!-- gh-comment-id:2706194250 --> @EKal-aa commented on GitHub (Mar 7, 2025): phi4-multimod++++ :) and is there any chance to run Pixtral in ollama sometime?
Author
Owner

@syedautherabbas commented on GitHub (Mar 8, 2025):

phi-4-multimodal +++++

<!-- gh-comment-id:2707824428 --> @syedautherabbas commented on GitHub (Mar 8, 2025): phi-4-multimodal +++++
Author
Owner

@vlimki commented on GitHub (Mar 8, 2025):

phi-4-multimodal +++++

<!-- gh-comment-id:2708186942 --> @vlimki commented on GitHub (Mar 8, 2025): phi-4-multimodal +++++
Author
Owner

@poucet commented on GitHub (Mar 8, 2025):

So looking forward to seeing this 👍

<!-- gh-comment-id:2708452197 --> @poucet commented on GitHub (Mar 8, 2025): So looking forward to seeing this 👍
Author
Owner

@kaykyr commented on GitHub (Mar 8, 2025):

phi-4-multimodal ++++++

<!-- gh-comment-id:2708505117 --> @kaykyr commented on GitHub (Mar 8, 2025): phi-4-multimodal ++++++
Author
Owner

@atilladeniz commented on GitHub (Mar 12, 2025):

phi-4-multimodal ++++++

<!-- gh-comment-id:2718010810 --> @atilladeniz commented on GitHub (Mar 12, 2025): phi-4-multimodal ++++++
Author
Owner

@leedaga commented on GitHub (Mar 13, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2720049773 --> @leedaga commented on GitHub (Mar 13, 2025): phi-4-multimodal +++++++
Author
Owner

@Meshwa428 commented on GitHub (Mar 13, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2721490664 --> @Meshwa428 commented on GitHub (Mar 13, 2025): phi-4-multimodal +++++++
Author
Owner

@Mat5heus commented on GitHub (Mar 13, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2722207655 --> @Mat5heus commented on GitHub (Mar 13, 2025): phi-4-multimodal +++++++
Author
Owner

@aluferraz commented on GitHub (Mar 13, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2722285795 --> @aluferraz commented on GitHub (Mar 13, 2025): phi-4-multimodal +++++++
Author
Owner

@bright8192 commented on GitHub (Mar 16, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2727503471 --> @bright8192 commented on GitHub (Mar 16, 2025): phi-4-multimodal +++++++
Author
Owner

@teamgroove commented on GitHub (Mar 17, 2025):

🎅 phi-4-multimodal +++++++

<!-- gh-comment-id:2727874489 --> @teamgroove commented on GitHub (Mar 17, 2025): 🎅 phi-4-multimodal +++++++
Author
Owner

@RobertoRiera commented on GitHub (Mar 21, 2025):

why is not available on ollama?

<!-- gh-comment-id:2744590973 --> @RobertoRiera commented on GitHub (Mar 21, 2025): why is not available on ollama?
Author
Owner

@Meshwa428 commented on GitHub (Mar 22, 2025):

Because it has audio input

<!-- gh-comment-id:2745020907 --> @Meshwa428 commented on GitHub (Mar 22, 2025): Because it has audio input
Author
Owner

@ddaying commented on GitHub (Mar 22, 2025):

phi-4-multimodal +++++++

<!-- gh-comment-id:2745048973 --> @ddaying commented on GitHub (Mar 22, 2025): phi-4-multimodal +++++++
Author
Owner

@RobertoRiera commented on GitHub (Mar 22, 2025):

Because it has audio input

Do they have to update the platform to be able to have models with audio? or they dont work with audio and thats it?

Thanks for the reply!

<!-- gh-comment-id:2745135577 --> @RobertoRiera commented on GitHub (Mar 22, 2025): > Because it has audio input Do they have to update the platform to be able to have models with audio? or they dont work with audio and thats it? Thanks for the reply!
Author
Owner

@Meshwa428 commented on GitHub (Mar 22, 2025):

Because it has audio input

Do they have to update the platform to be able to have models with audio? or they dont work with audio and thats it?

Thanks for the reply!

Yes they have to. And they'll have to add audio processing which is what they don't want to work with I guess. And audio is just extra bs

<!-- gh-comment-id:2745137136 --> @Meshwa428 commented on GitHub (Mar 22, 2025): > > Because it has audio input > > Do they have to update the platform to be able to have models with audio? or they dont work with audio and thats it? > > Thanks for the reply! Yes they have to. And they'll have to add audio processing which is what they don't want to work with I guess. And audio is just extra bs
Author
Owner

@torsteinelv commented on GitHub (Mar 22, 2025):

What about gemma3 doeseny this also have audio and video input?

<!-- gh-comment-id:2745138467 --> @torsteinelv commented on GitHub (Mar 22, 2025): What about gemma3 doeseny this also have audio and video input?
Author
Owner

@Meshwa428 commented on GitHub (Mar 22, 2025):

Gemma 3 is just image and video(without audio, they trim it in background)

Gemma3 is the same as llava

<!-- gh-comment-id:2745140381 --> @Meshwa428 commented on GitHub (Mar 22, 2025): Gemma 3 is just image and video(without audio, they trim it in background) Gemma3 is the same as llava
Author
Owner

@superchargez commented on GitHub (Mar 25, 2025):

What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience.

<!-- gh-comment-id:2750721497 --> @superchargez commented on GitHub (Mar 25, 2025): What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience.
Author
Owner

@bannert1337 commented on GitHub (Mar 25, 2025):

What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience.

The current architecture is purposefully built for text-to-text inference. Audio is completely different, as it would need to be fed as a data stream probably and require different approaches compared to tokenization of text.

This model is specifically capable of audio. This is different than transcriping audio and then using the transcript for text-to-text inference.

https://huggingface.co/microsoft/Phi-4-multimodal-instruct

<!-- gh-comment-id:2750907716 --> @bannert1337 commented on GitHub (Mar 25, 2025): > What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience. The current architecture is purposefully built for text-to-text inference. Audio is completely different, as it would need to be fed as a data stream probably and require different approaches compared to tokenization of text. This model is specifically capable of audio. This is different than transcriping audio and then using the transcript for text-to-text inference. https://huggingface.co/microsoft/Phi-4-multimodal-instruct
Author
Owner

@AlbertoSinigaglia commented on GitHub (Mar 31, 2025):

What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience.

The current architecture is purposefully built for text-to-text inference. Audio is completely different, as it would need to be fed as a data stream probably and require different approaches compared to tokenization of text.

This model is specifically capable of audio. This is different than transcriping audio and then using the transcript for text-to-text inference.

https://huggingface.co/microsoft/Phi-4-multimodal-instruct

Not an expert, but can't we just "shut down" the audio part of the model? I guess it can also work only on text no?

<!-- gh-comment-id:2767107297 --> @AlbertoSinigaglia commented on GitHub (Mar 31, 2025): > > What makes audio so difficult? I am willing to work, though I lack the skill to have it done, but would be great learning experience. > > The current architecture is purposefully built for text-to-text inference. Audio is completely different, as it would need to be fed as a data stream probably and require different approaches compared to tokenization of text. > > This model is specifically capable of audio. This is different than transcriping audio and then using the transcript for text-to-text inference. > > https://huggingface.co/microsoft/Phi-4-multimodal-instruct Not an expert, but can't we just "shut down" the audio part of the model? I guess it can also work only on text no?
Author
Owner

@Meshwa428 commented on GitHub (Apr 1, 2025):

Not an expert, but can't we just "shut down" the audio part of the model? I guess it can also work only on text no?

Then, what's the point of multi-modal ?

<!-- gh-comment-id:2767770549 --> @Meshwa428 commented on GitHub (Apr 1, 2025): > Not an expert, but can't we just "shut down" the audio part of the model? I guess it can also work only on text no? Then, what's the point of multi-modal ?
Author
Owner

@maximilianbehr commented on GitHub (May 14, 2025):

+1

<!-- gh-comment-id:2881230556 --> @maximilianbehr commented on GitHub (May 14, 2025): +1
Author
Owner

@Sboursen commented on GitHub (May 15, 2025):

+1

<!-- gh-comment-id:2881888327 --> @Sboursen commented on GitHub (May 15, 2025): +1
Author
Owner

@kulogix commented on GitHub (May 18, 2025):

+1

<!-- gh-comment-id:2889241996 --> @kulogix commented on GitHub (May 18, 2025): +1
Author
Owner

@TomLBZ commented on GitHub (Aug 22, 2025):

phi-4-multimodal +++++++++++++++++

<!-- gh-comment-id:3215105211 --> @TomLBZ commented on GitHub (Aug 22, 2025): phi-4-multimodal +++++++++++++++++
Author
Owner

@beardsciences commented on GitHub (Oct 14, 2025):

phi-4-multimodal ++++++

<!-- gh-comment-id:3403149336 --> @beardsciences commented on GitHub (Oct 14, 2025): phi-4-multimodal ++++++
Author
Owner

@poucet commented on GitHub (Oct 20, 2025):

Could someone please provide basic-instructions on how to get started with this task? I don't mind doing it, just need a few basic pointers to get started.

<!-- gh-comment-id:3421826367 --> @poucet commented on GitHub (Oct 20, 2025): Could someone please provide basic-instructions on how to get started with this task? I don't mind doing it, just need a few basic pointers to get started.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6125