[GH-ISSUE #2429] LLaVA 1.6 Models Unable to Process Specific Image Size and Resolution Locally #27178

Open
opened 2026-04-22 04:13:18 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @jianliao on GitHub (Feb 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2429

Environment

  • Version: Ollama v0.1.23
  • LLaVA Models Tested: 13b-1.6 and 34b-1.6
  • Local Machine Specs:
    • GPU: RTX3080ti 12GB
    • CPU: AMD 5800x
    • Memory: 32GB running on 3600mhz

Issue Description

I have encountered an issue where the local versions of the LLaVA 1.6 models (13b and 34b) are unable to process a 1070x150 png image. The error message returned is:

The image you've provided is too small and blurry for me to read the text and provide an accurate answer. Could you please try to provide a larger, clearer image or type out the question so I can assist you?

However, when testing the same image on the public hosted LLaVA 1.6 instance (https://llava.hliu.cc/), the image is processed without any issues.

Steps to Reproduce

  1. Run either ollama run llava:13b or ollama run llava:34b locally with the mentioned system specifications.
  2. Provide the model with the 1070x150 png image.
  3. Observe the error message indicating the image is too small and blurry.

Expected Behavior

The local models should process the image similar to the public hosted version, without returning an error about the image size and clarity.

Additional Context

This issue seems to be specific to the local setup with the mentioned specifications. It's unclear if this is a limitation of the local environment or a discrepancy between the local and hosted versions of the model.

Potential Causes

  • Different handling of image inputs between local and hosted versions.
  • Local resource limitations, although the specifications should be more than sufficient.
  • Possible bug in the local implementation of image preprocessing.

Attachments

  • Error message screenshot (if applicable)

image
  • The 1070x150 png image (for testing and reproducibility)

A-1

Originally created by @jianliao on GitHub (Feb 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2429 ### Environment - **Version**: Ollama v0.1.23 - **LLaVA Models Tested**: 13b-1.6 and 34b-1.6 - **Local Machine Specs**: - GPU: RTX3080ti 12GB - CPU: AMD 5800x - Memory: 32GB running on 3600mhz ### Issue Description I have encountered an issue where the local versions of the LLaVA 1.6 models (13b and 34b) are unable to process a 1070x150 png image. The error message returned is: `The image you've provided is too small and blurry for me to read the text and provide an accurate answer. Could you please try to provide a larger, clearer image or type out the question so I can assist you?` However, when testing the same image on the public hosted LLaVA 1.6 instance (https://llava.hliu.cc/), the image is processed without any issues. ### Steps to Reproduce 1. Run either `ollama run llava:13b` or `ollama run llava:34b` locally with the mentioned system specifications. 2. Provide the model with the 1070x150 png image. 3. Observe the error message indicating the image is too small and blurry. ### Expected Behavior The local models should process the image similar to the public hosted version, without returning an error about the image size and clarity. ### Additional Context This issue seems to be specific to the local setup with the mentioned specifications. It's unclear if this is a limitation of the local environment or a discrepancy between the local and hosted versions of the model. ### Potential Causes - Different handling of image inputs between local and hosted versions. - Local resource limitations, although the specifications should be more than sufficient. - Possible bug in the local implementation of image preprocessing. ### Attachments - Error message screenshot (if applicable) ---- <img width="759" alt="image" src="https://github.com/ollama/ollama/assets/1207520/bf461f23-4a05-4fc8-9c39-36c23a098cd0"> - The 1070x150 png image (for testing and reproducibility) ---- ![A-1](https://github.com/ollama/ollama/assets/1207520/41a5f112-580c-4e58-beb6-e2d807bd95e0)
GiteaMirror added the bug label 2026-04-22 04:13:18 -05:00
Author
Owner

@easp commented on GitHub (Feb 10, 2024):

Try 0.1.24 and see if it improves anything. There were some fixes for llava 1.6 merged into llama.cpp recently and it looks like they made it into the latest release of Ollama.

<!-- gh-comment-id:1936845624 --> @easp commented on GitHub (Feb 10, 2024): Try 0.1.24 and see if it improves anything. There were some fixes for llava 1.6 merged into llama.cpp recently and it looks like they made it into the latest release of Ollama.
Author
Owner

@jianliao commented on GitHub (Feb 10, 2024):

Try 0.1.24 and see if it improves anything. There were some fixes for llava 1.6 merged into llama.cpp recently and it looks like they made it into the latest release of Ollama.

Thank you for the suggestion! I've updated to Ollama v0.1.24 and retested with the same setup and image. Unfortunately, the issue persists and I'm still encountering the same error message regarding image size and clarity. If there are any other potential fixes or workarounds, I'd be eager to hear about them.

<!-- gh-comment-id:1936895777 --> @jianliao commented on GitHub (Feb 10, 2024): > Try 0.1.24 and see if it improves anything. There were some fixes for llava 1.6 merged into llama.cpp recently and it looks like they made it into the latest release of Ollama. Thank you for the suggestion! I've updated to Ollama v0.1.24 and retested with the same setup and image. Unfortunately, the issue persists and I'm still encountering the same error message regarding image size and clarity. If there are any other potential fixes or workarounds, I'd be eager to hear about them.
Author
Owner

@chigkim commented on GitHub (Feb 10, 2024):

Can you guys mark Llava 1.6 as partial support? It's not fully supported in Llama.cpp. People assume it's the same as Llava 1.6, and it's not there yet.

https://github.com/ggerganov/llama.cpp/pull/5267

The dev from Llava is also chiming in there.

<!-- gh-comment-id:1937075635 --> @chigkim commented on GitHub (Feb 10, 2024): Can you guys mark Llava 1.6 as partial support? It's not fully supported in Llama.cpp. People assume it's the same as Llava 1.6, and it's not there yet. https://github.com/ggerganov/llama.cpp/pull/5267 The dev from Llava is also chiming in there.
Author
Owner

@arcaweb-ch commented on GitHub (Feb 13, 2024):

Similar issue confirmed after updating to Ollama v0.1.24 / LLaVA 1.6

Inconsistent OCR Results with LLaVA 1.6 and Ollama vs. Online Demo #1116

<!-- gh-comment-id:1941058813 --> @arcaweb-ch commented on GitHub (Feb 13, 2024): Similar issue confirmed after updating to Ollama v0.1.24 / LLaVA 1.6 [Inconsistent OCR Results with LLaVA 1.6 and Ollama vs. Online Demo #1116](https://github.com/haotian-liu/LLaVA/issues/1116)
Author
Owner

@bmizerany commented on GitHub (Mar 11, 2024):

This seems to be fixed in the latest release. Using your prompt and image I'm getting:

>>> /Users/bmizerany/Desktop/303718891-41a5f112-580c-4e58-beb6-e2d807bd95e0 (1)
... .png
Added image '/Users/bmizerany/Desktop/303718891-41a5f112-580c-4e58-beb6-e2d807bd95e0 (1).png'
 The image you've provided appears to be a screenshot of a text editor or
word processor with a document that contains a mathematical expression. 
The expression seems to involve some form of integration over a domain \(
\Omega \), but the details are not clear due to the low resolution and 
lack of contrast in the image.

If you have specific questions about this mathematical expression, feel 
free to ask, and I'll do my best to help!

>>> What is the correct answer
 I'm unable to provide the correct answer to this mathematical expression
as it is not clear enough. The image shows a mathematical integral, which
involves a domain \( \Omega \) and a function f(x). To assist you with 
the correct answer, I would need a higher resolution or clearer image of 
the entire expression so that I can read and interpret the details 
accurately. If you have any specific questions about parts of the 
expression, feel free to ask!

Please reopen and update this ticket if you're still having issues.

<!-- gh-comment-id:1989229022 --> @bmizerany commented on GitHub (Mar 11, 2024): This seems to be fixed in the latest release. Using your prompt and image I'm getting: ``` >>> /Users/bmizerany/Desktop/303718891-41a5f112-580c-4e58-beb6-e2d807bd95e0 (1) ... .png Added image '/Users/bmizerany/Desktop/303718891-41a5f112-580c-4e58-beb6-e2d807bd95e0 (1).png' The image you've provided appears to be a screenshot of a text editor or word processor with a document that contains a mathematical expression. The expression seems to involve some form of integration over a domain \( \Omega \), but the details are not clear due to the low resolution and lack of contrast in the image. If you have specific questions about this mathematical expression, feel free to ask, and I'll do my best to help! >>> What is the correct answer I'm unable to provide the correct answer to this mathematical expression as it is not clear enough. The image shows a mathematical integral, which involves a domain \( \Omega \) and a function f(x). To assist you with the correct answer, I would need a higher resolution or clearer image of the entire expression so that I can read and interpret the details accurately. If you have any specific questions about parts of the expression, feel free to ask! ``` Please reopen and update this ticket if you're still having issues.
Author
Owner

@jianliao commented on GitHub (Mar 12, 2024):

Hi @bmizerany,

Thanks for looking into this. I've tested with the latest release of Ollama (v0.1.28), and I'm still facing the same issue. It appears that the fix from https://github.com/ggerganov/llama.cpp/pull/5267, although merged, may not have propagated correctly. According to my understanding, two steps are required for the resolution:

  1. The embedded llama.cpp needs to be updated to include the changes from https://github.com/ggerganov/llama.cpp/pull/5267.
  2. The LLaVA 1.6 model provided by Ollama needs to be re-generated following the instructions in its README.md.

Could you please confirm if these steps have been completed in the latest release? If not, could we reopen the issue to address the persistent problem?

Looking forward to your response.

<!-- gh-comment-id:1990493201 --> @jianliao commented on GitHub (Mar 12, 2024): Hi @bmizerany, Thanks for looking into this. I've tested with the latest release of Ollama (v0.1.28), and I'm still facing the same issue. It appears that the fix from https://github.com/ggerganov/llama.cpp/pull/5267, although merged, may not have propagated correctly. According to my understanding, two steps are required for the resolution: 1. The embedded llama.cpp needs to be updated to include the changes from https://github.com/ggerganov/llama.cpp/pull/5267. 2. The LLaVA 1.6 model provided by Ollama needs to be re-generated following the instructions in its [README.md](https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/README.md#llava-16-gguf-conversion). Could you please confirm if these steps have been completed in the latest release? If not, could we reopen the issue to address the persistent problem? Looking forward to your response.
Author
Owner

@bmizerany commented on GitHub (Mar 12, 2024):

@jianliao Ive reopened and will look into further.

<!-- gh-comment-id:1991604672 --> @bmizerany commented on GitHub (Mar 12, 2024): @jianliao Ive reopened and will look into further.
Author
Owner

@chigkim commented on GitHub (Mar 12, 2024):

https://github.com/ggerganov/llama.cpp/pull/5267 only applies to llava-cli which Ollama doesn't use.
The fix for server is in https://github.com/ggerganov/llama.cpp/pull/5553, and it's merged. I'm not sure if Ollama uses the update that includes the pr.
Also there's https://github.com/ggerganov/llama.cpp/pull/5896 which fixes the bug that incorrectly reports prompt token count. The bug doesn't impact the output, but just misreports the token count. This pr has not merged into the main branch of llama.cpp because they're refactoring llama.cpp server in https://github.com/ggerganov/llama.cpp/pull/5882, and they're going to remove multimodal capability from llama.cpp server, and add it later.
It said: "Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future"
We might lose multimodal capability for a while. Multimodal doesn't get love as much. :(

<!-- gh-comment-id:1991768307 --> @chigkim commented on GitHub (Mar 12, 2024): https://github.com/ggerganov/llama.cpp/pull/5267 only applies to llava-cli which Ollama doesn't use. The fix for server is in https://github.com/ggerganov/llama.cpp/pull/5553, and it's merged. I'm not sure if Ollama uses the update that includes the pr. Also there's https://github.com/ggerganov/llama.cpp/pull/5896 which fixes the bug that incorrectly reports prompt token count. The bug doesn't impact the output, but just misreports the token count. This pr has not merged into the main branch of llama.cpp because they're refactoring llama.cpp server in https://github.com/ggerganov/llama.cpp/pull/5882, and they're going to remove multimodal capability from llama.cpp server, and add it later. It said: "Remove multimodal capabilities - I don't like the existing implementation. Better to completely remove it and implement it properly in the future" We might lose multimodal capability for a while. Multimodal doesn't get love as much. :(
Author
Owner

@chigkim commented on GitHub (Mar 12, 2024):

A couple of weeks ago, I opened #2795 to request to incorporate https://github.com/ggerganov/llama.cpp/pull/5553, but it looks like it hasn't been done yet. At least there's no further activity on that particular issue other than @jmorganca self assigned and added a label as a bug.

<!-- gh-comment-id:1991794677 --> @chigkim commented on GitHub (Mar 12, 2024): A couple of weeks ago, I opened #2795 to request to incorporate https://github.com/ggerganov/llama.cpp/pull/5553, but it looks like it hasn't been done yet. At least there's no further activity on that particular issue other than @jmorganca self assigned and added a label as a bug.
Author
Owner

@jianliao commented on GitHub (Mar 13, 2024):

Hi @bmizerany and the team,

I wanted to follow up on the image processing issue we've been discussing. I took the initiative to regenerate the LLaVA-v1.6 GGUF model using the steps outlined in the llama.cpp README.md. I'm pleased to report that after doing so, I was able to successfully generate the image caption with this newly generated model.

This suggests that the fix from llama#5267 is effective, and applying these updates could potentially resolve the issue for other users as well. I'd recommend including updated models or providing clear documentation on how users can regenerate their models to benefit from the latest fixes.

Here is the inference output:

This image appears to contain a mathematical puzzle or riddle that asks for the value of the expression "3 + 1 / (2 + 3)". The answer provided is "(A) 4", which corresponds to the option D in the list.

The rationale behind this answer lies in the fact that any number multiplied by 0/0 (i.e., undefined or indeterminate) results in a value of 0, as there are no terms to multiply with zero. In this case, the numerator is 2 + 3 = 5 and the denominator is 1, so when you multiply these together, you get 0/0.

However, according to conventional arithmetic, multiplying a number by 0 is 0, not undefined or indeterminate. The correct answer would be to divide both sides of the equation by 2 and then by 3, which gives (5/2) / (5/3) = 3/4.

The riddle seems to be trying to challenge common mathematical concepts and perhaps test the solver's understanding of mathematical rules and principles that may not necessarily apply in every situation or context.
<!-- gh-comment-id:1994769794 --> @jianliao commented on GitHub (Mar 13, 2024): Hi @bmizerany and the team, I wanted to follow up on the image processing issue we've been discussing. I took the initiative to regenerate the LLaVA-v1.6 GGUF model using the steps outlined in the [llama.cpp README.md](https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/README.md). I'm pleased to report that after doing so, I was able to successfully generate the image caption with this newly generated model. This suggests that the fix from llama#5267 is effective, and applying these updates could potentially resolve the issue for other users as well. I'd recommend including updated models or providing clear documentation on how users can regenerate their models to benefit from the latest fixes. Here is the inference output: ``` This image appears to contain a mathematical puzzle or riddle that asks for the value of the expression "3 + 1 / (2 + 3)". The answer provided is "(A) 4", which corresponds to the option D in the list. The rationale behind this answer lies in the fact that any number multiplied by 0/0 (i.e., undefined or indeterminate) results in a value of 0, as there are no terms to multiply with zero. In this case, the numerator is 2 + 3 = 5 and the denominator is 1, so when you multiply these together, you get 0/0. However, according to conventional arithmetic, multiplying a number by 0 is 0, not undefined or indeterminate. The correct answer would be to divide both sides of the equation by 2 and then by 3, which gives (5/2) / (5/3) = 3/4. The riddle seems to be trying to challenge common mathematical concepts and perhaps test the solver's understanding of mathematical rules and principles that may not necessarily apply in every situation or context. ```
Author
Owner

@chigkim commented on GitHub (Mar 14, 2024):

https://github.com/ggerganov/llama.cpp/pull/5267
alone doesn't work for Ollama.
Ollama also needs
https://github.com/ggerganov/llama.cpp/pull/5553
and
https://github.com/ggerganov/llama.cpp/pull/5896
for everything to work properly.

<!-- gh-comment-id:1996190960 --> @chigkim commented on GitHub (Mar 14, 2024): https://github.com/ggerganov/llama.cpp/pull/5267 alone doesn't work for Ollama. Ollama also needs https://github.com/ggerganov/llama.cpp/pull/5553 and https://github.com/ggerganov/llama.cpp/pull/5896 for everything to work properly.
Author
Owner

@marksalpeter commented on GitHub (Mar 17, 2024):

Running into the same issue here. Any idea when these fixes will be added to the llava packaged for ollama @jmorganca?

<!-- gh-comment-id:2002652327 --> @marksalpeter commented on GitHub (Mar 17, 2024): Running into the same issue here. Any idea when these fixes will be added to the llava packaged for ollama @jmorganca?
Author
Owner

@YanWittmann commented on GitHub (Mar 31, 2024):

Same issue here:

Sorry, it seems like I cannot see any screen capture description from your input. Could you please provide the screen capture description in text format or share a clearer screen capture image for me to analyze?

With a clearly readable large text in a 1680 x 1050 image on the 34b model.

<!-- gh-comment-id:2028912367 --> @YanWittmann commented on GitHub (Mar 31, 2024): Same issue here: Sorry, it seems like I cannot see any screen capture description from your input. Could you please provide the screen capture description in text format or share a clearer screen capture image for me to analyze? With a clearly readable large text in a 1680 x 1050 image on the 34b model.
Author
Owner

@cjpais commented on GitHub (Apr 4, 2024):

as far as i can tell the code has most of the PR's in already (assuming it lives in llm/ext_server/server.cpp. it looks like the projector files used are an older version from the logs.

is there any documentation regarding how to create Modelfiles for multimodal or generally how the multimodal implementation works at a high level?

I would love to test if updating the projector alone works, and also how to swap in and out projectors easily. I am also happy to submit a PR with code changes to pull in any necessary PR's from llama.cpp if they are missing. Also if there is not support for projector in Modelfiles, I can also support writing some changes for them with a little guidance where to update

<!-- gh-comment-id:2037676360 --> @cjpais commented on GitHub (Apr 4, 2024): as far as i can tell the code has most of the PR's in already (assuming it lives in `llm/ext_server/server.cpp`. it looks like the projector files used are an older version from the logs. is there any documentation regarding how to create Modelfiles for multimodal or generally how the multimodal implementation works at a high level? I would love to test if updating the projector alone works, and also how to swap in and out projectors easily. I am also happy to submit a PR with code changes to pull in any necessary PR's from llama.cpp if they are missing. Also if there is not support for projector in Modelfiles, I can also support writing some changes for them with a little guidance where to update
Author
Owner

@themantalope commented on GitHub (Jun 27, 2024):

Any updates on this? I'm still getting the same issue. Using the docker compose stack from here

<!-- gh-comment-id:2194979635 --> @themantalope commented on GitHub (Jun 27, 2024): Any updates on this? I'm still getting the same issue. Using the docker compose stack from [here](https://github.com/valiantlynx/ollama-docker)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27178