[GH-ISSUE #8853] Problems with Multimodal models #31500

Closed
opened 2026-04-22 11:57:51 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @MLRadfys on GitHub (Feb 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8853

What is the issue?

Hi all!

Iam trying to run a multimodal model on a videostream but noticed that all models seem to struggle.

Iam running inference on frame level and tried easy prompts like "Is there a person in the image? Answer with Yes or No"

Even if the stream does not show any person (in a quite easy scene), the answer sometimes toggles to Yes. So basically I cab have 10 frames where with No, and suddenly the answer is Yes and then No again.

In addition, all models seem to have problems following instructions.
The answer is nearly never just Yes or No. Sometimes it is "No, there is no person in the image" or "No, the scene shows ...".
For small models like the llava-phi3, the output is often super bad and not even tasks related ("Answer?", "I cannot help you with that ", "[j/han=}]...").

Does anyone experience similar issues or is this because of the quantization?

Thanks in advance,

Cheers,

M

OS

Ubuntu

GPU

Rtx4090

CPU

No response

Ollama version

No response

Originally created by @MLRadfys on GitHub (Feb 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8853 ### What is the issue? Hi all! Iam trying to run a multimodal model on a videostream but noticed that all models seem to struggle. Iam running inference on frame level and tried easy prompts like "Is there a person in the image? Answer with Yes or No" Even if the stream does not show any person (in a quite easy scene), the answer sometimes toggles to Yes. So basically I cab have 10 frames where with No, and suddenly the answer is Yes and then No again. In addition, all models seem to have problems following instructions. The answer is nearly never just Yes or No. Sometimes it is "No, there is no person in the image" or "No, the scene shows ...". For small models like the llava-phi3, the output is often super bad and not even tasks related ("Answer?", "I cannot help you with that ", "[j/han=}]..."). Does anyone experience similar issues or is this because of the quantization? Thanks in advance, Cheers, M ### OS Ubuntu ### GPU Rtx4090 ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 11:57:51 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 5, 2025):

How are you feeding the stream to the model?

<!-- gh-comment-id:2637613798 --> @rick-github commented on GitHub (Feb 5, 2025): How are you feeding the stream to the model?
Author
Owner

@MLRadfys commented on GitHub (Feb 5, 2025):

Hi Rick and thanks for your reply!

Iam basically using openCV to fetch the frames from a camera stream.
A single frame I put in queue and inference is performed on a single image:

def run_inference(self):
      while True:
          frame = self.inference_queue.get()
          if frame is not None:
              image_base64 = frame_to_base64(frame)
              response = send_to_llava(image_base64)
              with self.lock:
                  self.text_overlay = response  # Update overlay text

My model request looks like this:

def send_to_llava(image_base64):
   
    try:
      
        prompt = "Is there a person in the image? Answer with yes or no."
        url = "http://127.0.0.1:11434/api/generate"
        headers = {"Content-Type": "application/json"}
        data = {
            "model": "minicpm-v",
            "prompt": prompt,
            "images": [image_base64],
            "max_tokens": 100,
            "temperature": 0.7,
            "stream": False
        }

        response = requests.post(url, headers=headers, data=json.dumps(data))
        result = response.json()

       return result
<!-- gh-comment-id:2637681638 --> @MLRadfys commented on GitHub (Feb 5, 2025): Hi Rick and thanks for your reply! Iam basically using openCV to fetch the frames from a camera stream. A single frame I put in queue and inference is performed on a single image: ``` def run_inference(self): while True: frame = self.inference_queue.get() if frame is not None: image_base64 = frame_to_base64(frame) response = send_to_llava(image_base64) with self.lock: self.text_overlay = response # Update overlay text ``` My model request looks like this: ``` def send_to_llava(image_base64): try: prompt = "Is there a person in the image? Answer with yes or no." url = "http://127.0.0.1:11434/api/generate" headers = {"Content-Type": "application/json"} data = { "model": "minicpm-v", "prompt": prompt, "images": [image_base64], "max_tokens": 100, "temperature": 0.7, "stream": False } response = requests.post(url, headers=headers, data=json.dumps(data)) result = response.json() return result ```
Author
Owner

@kevin-pw commented on GitHub (Feb 5, 2025):

"temperature": 0.7,

If you want a deterministic "Yes" or "No" response, a temperature setting of 0.0 might avoid the issue you are describing. (The issue being that the response sometimes toggles to "Yes" despite no person being present in the image).

Essentially, the temperature determines the selection of the next token from the probability distribution of predicted next tokens. For example, if the probability of the next token being "No" = 0.9 and "Yes" = 0.1 then a temperature of 0.0 will always select "No" as the next token, but a temperature > 0.0 will result in a "Yes" selection some of the time.

EDIT:
Never mind, I just tested the same prompt Is there a person in the image? Answer with Yes or No. on the models llama3.2-vision:11b and llama3.2-vision:90b. Both models responded with No on several images very clearly containing a person.

Might this be a limitation of the training these models received? I remember reading somewhere that faces were blurred in the llama training process to preserve privacy.

<!-- gh-comment-id:2637815662 --> @kevin-pw commented on GitHub (Feb 5, 2025): > ``` "temperature": 0.7, ``` If you want a deterministic "Yes" or "No" response, a temperature setting of 0.0 might avoid the issue you are describing. (The issue being that the response sometimes toggles to "Yes" despite no person being present in the image). Essentially, the temperature determines the selection of the next token from the probability distribution of predicted next tokens. For example, if the probability of the next token being "No" = 0.9 and "Yes" = 0.1 then a temperature of 0.0 will always select "No" as the next token, but a temperature > 0.0 will result in a "Yes" selection some of the time. EDIT: Never mind, I just tested the same prompt `Is there a person in the image? Answer with Yes or No.` on the models `llama3.2-vision:11b` and `llama3.2-vision:90b`. Both models responded with `No` on several images very clearly containing a person. Might this be a limitation of the training these models received? I remember reading somewhere that faces were blurred in the llama training process to preserve privacy.
Author
Owner

@MLRadfys commented on GitHub (Feb 5, 2025):

Good point Kevin, thats true, thanks!
Nevertheless, I got the impression that the performance is far from the original float32 model.

/M

<!-- gh-comment-id:2637842962 --> @MLRadfys commented on GitHub (Feb 5, 2025): Good point Kevin, thats true, thanks! Nevertheless, I got the impression that the performance is far from the original float32 model. /M
Author
Owner

@kevin-pw commented on GitHub (Feb 5, 2025):

I got the impression that the performance is far from the original float32 model.

I think you are correct @MLRadfys - see the edit in my previous comment. I would expect the vision models to have no problem distinguishing between the presence and absence of a person, but I received several incorrect responses on the few images I briefly tested.

Does anybody know what might be causing the poor performance when detecting a person in an image?

<!-- gh-comment-id:2637870206 --> @kevin-pw commented on GitHub (Feb 5, 2025): > I got the impression that the performance is far from the original float32 model. I think you are correct @MLRadfys - see the edit in my previous comment. I would expect the vision models to have no problem distinguishing between the presence and absence of a person, but I received several incorrect responses on the few images I briefly tested. Does anybody know what might be causing the poor performance when detecting a person in an image?
Author
Owner

@MLRadfys commented on GitHub (Feb 5, 2025):

That is super weird.
I tested a lot of float32 models directly from Huggingface, like e.g. Qwen2-VL, Phi3.5, Llava, Molmo ... I never had any issues like this, rather these models were very impressive.
Most of these models are instruction fine-tuned on many different tasks and usually one of them is related to object detection.

So Iam not sure where this behavior is coming from. :-(

<!-- gh-comment-id:2637900202 --> @MLRadfys commented on GitHub (Feb 5, 2025): That is super weird. I tested a lot of float32 models directly from Huggingface, like e.g. Qwen2-VL, Phi3.5, Llava, Molmo ... I never had any issues like this, rather these models were very impressive. Most of these models are instruction fine-tuned on many different tasks and usually one of them is related to object detection. So Iam not sure where this behavior is coming from. :-(
Author
Owner

@MLRadfys commented on GitHub (Feb 6, 2025):

So I just tried the Qwen2-VL model with 2B parameters. No problems at all. Even with a higher temperature value of 0.7, the answer is constantly "Yes" or "No". No hallucinations or additional output.

I don't really now if its me doing something wrong in Ollama or if it is something wrong with the models.

<!-- gh-comment-id:2638982351 --> @MLRadfys commented on GitHub (Feb 6, 2025): So I just tried the Qwen2-VL model with 2B parameters. No problems at all. Even with a higher temperature value of 0.7, the answer is constantly "Yes" or "No". No hallucinations or additional output. I don't really now if its me doing something wrong in Ollama or if it is something wrong with the models.
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

It looks like you are comparing the performance of safetensor models with the default quants of ollama models. Have you tried using the fp16 quant?

<!-- gh-comment-id:2639183862 --> @rick-github commented on GitHub (Feb 6, 2025): It looks like you are comparing the performance of safetensor models with the default quants of ollama models. Have you tried using the fp16 quant?
Author
Owner

@MLRadfys commented on GitHub (Feb 6, 2025):

Hi Rick and thanks again!

You mean instead of e.g. using the "llava" tag i should use "llava:7b-v1.6-mistral-fp16"?
Do you know what is the default quantization in Ollama is?

Cheers,

M

<!-- gh-comment-id:2639254106 --> @MLRadfys commented on GitHub (Feb 6, 2025): Hi Rick and thanks again! You mean instead of e.g. using the "llava" tag i should use "llava:7b-v1.6-mistral-fp16"? Do you know what is the default quantization in Ollama is? Cheers, M
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

The default quant used to be q4_0 (the default for llava). For more recent additions to the ollama library, the default quant is q4_K_M. In the ollama library, fp16 is as close as you can get to the original 16/32 bit unquantized model.

<!-- gh-comment-id:2639268116 --> @rick-github commented on GitHub (Feb 6, 2025): The default quant used to be q4_0 (the default for llava). For more recent additions to the ollama library, the default quant is q4_K_M. In the ollama library, fp16 is as close as you can get to the original 16/32 bit unquantized model.
Author
Owner

@MLRadfys commented on GitHub (Feb 6, 2025):

Awesome, thanks. I will try the fp16 models..do you have any experience with the q4_0 models. Is it reasonable to expect such a large drop in performance?

<!-- gh-comment-id:2639271202 --> @MLRadfys commented on GitHub (Feb 6, 2025): Awesome, thanks. I will try the fp16 models..do you have any experience with the q4_0 models. Is it reasonable to expect such a large drop in performance?
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

It depends on how much precision was encoded in to the model weights. I would expect that for vision models, this would be quite high, so 4 bit quantization will be quite significant. It comes down to a tradeoff between size and accuracy. Vision models, for the most part, aren't very large, so going for q8/fp16 would be a better choice than q4.

<!-- gh-comment-id:2639285852 --> @rick-github commented on GitHub (Feb 6, 2025): It depends on how much precision was encoded in to the model weights. I would expect that for vision models, this would be quite high, so 4 bit quantization will be quite significant. It comes down to a tradeoff between size and accuracy. Vision models, for the most part, aren't very large, so going for q8/fp16 would be a better choice than q4.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31500