[GH-ISSUE #2788] Bug: LLaVA 1.6 34b not respecting initial user prompt #1683

Closed
opened 2026-04-12 11:39:23 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @mobilemike on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2788

M2 Max MBP 96GB RAM
Ollama 0.1.27
Compared against llama.cpp CLI @b11a93d (same as Ollama version)

Problem:
When using the Ollama CLI or API with an image, the initial text prompt isn't respected. Examples like the one on the model page typically show prompts like "What is in this picture?". However, when changing the initial prompt to something like "Is this image of a llama?" or "How many animals are in this picture?" or even "Ignore the image and tell me the meaning of life", the output is typically a description of the image.

When using the llama.cpp CLI however, these prompts are followed as expected.

When using chat completions in Ollama, a followup question does work properly, so after initially being ignored you can get the expected output on a second attempt.

My suspicion is that this behavior is largely unnoticed, as the default examples are asking for a description and one is being returned. However, this is masking the fact that text prompts used in conjunction with image prompts aren't being properly utilized. As an aside, LM Studio suffers from the same issue.

The below examples use the same image as the one base64 encoded in the above model page CLI example.

Ollama example:

❯ ollama run llava:34b-v1.6-q6_K
>>> /set parameter temperature 0.2
Set parameter 'temperature' to '0.2'
>>> How many animals are in this picture? /Users/mike/Downloads/llama.png
Added image '/Users/mike/Downloads/llama.png'
The image you've provided appears to be a cartoon or illustration of an
animal character. It looks like a cute, stylized depiction of a pig with a
happy expression and waving its hand as if saying hello or goodbye. The
art style is simplistic and playful, which is common in many modern
cartoons and emojis.

>>> How many animals are in this picture?
There is only one animal in this picture, which is the cute pig character.

llama.cpp example:

❯ ./llava-cli -m ../llm-models/cmp-nct/llava-1.6-gguf/ggml-yi-34b-f16-q_5_k.gguf --mmproj ../llm-models/cmp-nct/llava-1.6-gguf/mmproj-llava-34b-f16-q6_k.gguf --image ~/Downloads/llama.png --temp 0.2 -e -p '<|im_start|>system\n<|im_end|><|im_start|>user\n<image>\nHow many animals are in this picture?<|im_end|><|im_start|>assistant\n'
clip_model_load: model name:   vit-large336-custom
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    378
clip_model_load: n_kv:         26
clip_model_load: ftype:        q6_K

...

system_prompt: <|im_start|>system
<|im_end|><|im_start|>user

user_prompt:
How many animals are in this picture?<|im_end|><|im_start|>assistant


There is one animal in this picture, which appears to be a stylized drawing of a pig.

...
Originally created by @mobilemike on GitHub (Feb 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2788 M2 Max MBP 96GB RAM Ollama 0.1.27 Compared against llama.cpp CLI @b11a93d (same as Ollama version) Problem: When using the Ollama CLI or API with an image, the initial text prompt isn't respected. Examples like the one on the [model page](https://ollama.com/library/llava) typically show prompts like "What is in this picture?". However, when changing the initial prompt to something like "Is this image of a llama?" or "How many animals are in this picture?" or even "Ignore the image and tell me the meaning of life", the output is typically a description of the image. When using the llama.cpp CLI however, these prompts are followed as expected. When using chat completions in Ollama, a followup question _does_ work properly, so after initially being ignored you can get the expected output on a second attempt. My suspicion is that this behavior is largely unnoticed, as the default examples are asking for a description and one is being returned. However, this is masking the fact that text prompts used in conjunction with image prompts aren't being properly utilized. As an aside, LM Studio suffers from the same issue. The below examples use the same image as the one base64 encoded in the above model page CLI example. Ollama example: ``` ❯ ollama run llava:34b-v1.6-q6_K >>> /set parameter temperature 0.2 Set parameter 'temperature' to '0.2' >>> How many animals are in this picture? /Users/mike/Downloads/llama.png Added image '/Users/mike/Downloads/llama.png' The image you've provided appears to be a cartoon or illustration of an animal character. It looks like a cute, stylized depiction of a pig with a happy expression and waving its hand as if saying hello or goodbye. The art style is simplistic and playful, which is common in many modern cartoons and emojis. >>> How many animals are in this picture? There is only one animal in this picture, which is the cute pig character. ``` llama.cpp example: ``` ❯ ./llava-cli -m ../llm-models/cmp-nct/llava-1.6-gguf/ggml-yi-34b-f16-q_5_k.gguf --mmproj ../llm-models/cmp-nct/llava-1.6-gguf/mmproj-llava-34b-f16-q6_k.gguf --image ~/Downloads/llama.png --temp 0.2 -e -p '<|im_start|>system\n<|im_end|><|im_start|>user\n<image>\nHow many animals are in this picture?<|im_end|><|im_start|>assistant\n' clip_model_load: model name: vit-large336-custom clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 3 clip_model_load: alignment: 32 clip_model_load: n_tensors: 378 clip_model_load: n_kv: 26 clip_model_load: ftype: q6_K ... system_prompt: <|im_start|>system <|im_end|><|im_start|>user user_prompt: How many animals are in this picture?<|im_end|><|im_start|>assistant There is one animal in this picture, which appears to be a stylized drawing of a pig. ... ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1683