[GH-ISSUE #3929] Can you please add llava-phi-3-mini by xtuner? #2435

Closed
opened 2026-04-12 12:45:08 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @yashasnadigsyn on GitHub (Apr 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3929

Here is the model gguf link: https://huggingface.co/xtuner/llava-phi-3-mini-gguf

Here is the model hf link: https://huggingface.co/xtuner/llava-phi-3-mini-hf

I have been trying to add it manually by modelfile but i can't seem to understand the template. I tried using llava template, bakllava template, other multimodal templates but the model confuses.

Can anyone help me?

Originally created by @yashasnadigsyn on GitHub (Apr 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3929 Here is the model gguf link: https://huggingface.co/xtuner/llava-phi-3-mini-gguf Here is the model hf link: https://huggingface.co/xtuner/llava-phi-3-mini-hf I have been trying to add it manually by modelfile but i can't seem to understand the template. I tried using llava template, bakllava template, other multimodal templates but the model confuses. Can anyone help me?
GiteaMirror added the model label 2026-04-12 12:45:08 -05:00
Author
Owner

@TimeLordRaps commented on GitHub (Apr 26, 2024):

I've tried adding it using the split FROM logic added in #1308 that works
I've tried concatenating both files as was suggested in that thread and that works.
Here's the most recent modelfile I am using:

FROM "\path\to\ggml-model-int4.gguf"
FROM "\path\to\mmproj-model-f16.gguf"
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|system|>"
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"

PARAMETER num_keep 4
PARAMETER num_ctx 4096

I will note ocr is fairly weak in comparison to what I'd expect from a better projector so might be worth the wait, at least in prelim tests on default ollama temp.

<!-- gh-comment-id:2080054055 --> @TimeLordRaps commented on GitHub (Apr 26, 2024): I've tried adding it using the split FROM logic added in #1308 that works I've tried concatenating both files as was suggested in that thread and that works. Here's the most recent modelfile I am using: ``` FROM "\path\to\ggml-model-int4.gguf" FROM "\path\to\mmproj-model-f16.gguf" TEMPLATE """{{ if .System }}<|system|> {{ .System }}<|end|> {{ end }}{{ if .Prompt }}<|user|> {{ .Prompt }}<|end|> {{ end }}<|assistant|> {{ .Response }}<|end|> """ PARAMETER stop "<|user|>" PARAMETER stop "<|assistant|>" PARAMETER stop "<|system|>" PARAMETER stop "<|end|>" PARAMETER stop "<|endoftext|>" PARAMETER num_keep 4 PARAMETER num_ctx 4096 ``` I will note ocr is fairly weak in comparison to what I'd expect from a better projector so might be worth the wait, at least in prelim tests on default ollama temp.
Author
Owner

@TimeLordRaps commented on GitHub (Apr 26, 2024):

Specifically if I create a model with the name llava-phi with that model file and do:

ollama run llava-phi

The in the prompt do

Read the text in this image ./image.png

Where the image is this image from ollama's vision model blog announcement

I get these responses:

  1. The image features a humorous meme about an open-source software project named "Ollama Vision". The main element of the image is a white cartoon cat wearing sunglasses and holding a sign that reads
    "Welcome to the era of open-source multimodal models." This text suggests a playful take on the concept of open-source technology. Below this, there's another line of text that says "Open source is
    great! Now make it funny too!" which further emphasizes the meme's lighthearted message about making software development more enjoyable. The overall image composition and text placement create a clear
    and engaging visual narrative for the viewer.

  2. Ollama Vision
    Welcome to the era of open-source multiplatform models.
    Open your mind to a new way of thinking.
    Open your eyes to see the world.
    Ollama is a free and open-source image recognition API for Python, Java, C++ and JavaScript.
    For more information visit our website: https://github.com/OllamaVision/

So clearly if someone can optimize the model for diffusing the problem of ocr hallucinations then we may be closer to something usable at that scale.

Also just for me it sucks at writing songs about images as easy as the sunset.

<!-- gh-comment-id:2080062367 --> @TimeLordRaps commented on GitHub (Apr 26, 2024): Specifically if I create a model with the name llava-phi with that model file and do: ``` ollama run llava-phi ``` The in the prompt do >Read the text in this image ./image.png Where the image is this ![image from ollama's vision model blog announcement](https://ollama.com/public/blog/vision.png) I get these responses: 1. The image features a humorous meme about an open-source software project named "Ollama Vision". The main element of the image is a white cartoon cat wearing sunglasses and holding a sign that reads "Welcome to the era of open-source multimodal models." This text suggests a playful take on the concept of open-source technology. Below this, there's another line of text that says "Open source is great! Now make it funny too!" which further emphasizes the meme's lighthearted message about making software development more enjoyable. The overall image composition and text placement create a clear and engaging visual narrative for the viewer. 2. Ollama Vision Welcome to the era of open-source multiplatform models. Open your mind to a new way of thinking. Open your eyes to see the world. Ollama is a free and open-source image recognition API for Python, Java, C++ and JavaScript. For more information visit our website: <https://github.com/OllamaVision/> So clearly if someone can optimize the model for diffusing the problem of ocr hallucinations then we may be closer to something usable at that scale. Also just for me it sucks at writing songs about images as easy as the sunset.
Author
Owner

@yashasnadigsyn commented on GitHub (Apr 27, 2024):

Thank you so much. I was an idiot for even forgetting the mmproj file. I wasted the whole day yesterday just tinkering with the int4 file and forgot the mmproj file. The modelfile works perfectly. Thank you for it.

Also for the OCR, I have been able to get somewhat good results with "demanding" prompts.

This was my prompt for the same above image:

What are the texts in the image. Your output should only have the texts in the image without any other things.

The output:

Ollama Vision Welcome to the era of open-source multiplatform models. Open-source models can be made easily now.

But, using the same prompt again:

What are the texts in the image. Your output should only have the texts in the image without any other things.

I get bad results:

Ollama Vision Welcome to the era of open-source multiplatform models. Operate web browsers and make code available now.

But, it is good with object recognition. Personally, i found it faster and better than llava-7b-v1.6-mistral and bakllava-7b-v1 though i have tested it with only two images.

For OCR especially, I have been looking at this model by internlm: internlm-xcomposer2-vl-1.8b. It has pretty good results.

Again, Thank you so much for the modelfile.

<!-- gh-comment-id:2080317603 --> @yashasnadigsyn commented on GitHub (Apr 27, 2024): Thank you so much. I was an idiot for even forgetting the mmproj file. I wasted the whole day yesterday just tinkering with the int4 file and forgot the mmproj file. The modelfile works perfectly. Thank you for it. Also for the OCR, I have been able to get somewhat good results with "demanding" prompts. This was my prompt for the same above image: > What are the texts in the image. Your output should only have the texts in the image without any other things. The output: > Ollama Vision Welcome to the era of open-source multiplatform models. Open-source models can be made easily now. But, using the same prompt again: > What are the texts in the image. Your output should only have the texts in the image without any other things. I get bad results: > Ollama Vision Welcome to the era of open-source multiplatform models. Operate web browsers and make code available now. But, it is good with object recognition. Personally, i found it faster and better than llava-7b-v1.6-mistral and bakllava-7b-v1 though i have tested it with only two images. For OCR especially, I have been looking at this model by internlm: [internlm-xcomposer2-vl-1.8b](https://huggingface.co/internlm/internlm-xcomposer2-vl-1_8b). It has pretty good results. Again, Thank you so much for the modelfile.
Author
Owner

@TimeLordRaps commented on GitHub (Apr 27, 2024):

All good do you think this can be closed at this point?

<!-- gh-comment-id:2080323111 --> @TimeLordRaps commented on GitHub (Apr 27, 2024): All good do you think this can be closed at this point?
Author
Owner

@yashasnadigsyn commented on GitHub (Apr 27, 2024):

yeah, Thank you!

<!-- gh-comment-id:2080323762 --> @yashasnadigsyn commented on GitHub (Apr 27, 2024): yeah, Thank you!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2435