[GH-ISSUE #3298] Vision with llava-1.6-7B is unusable via CLI #64067

Closed
opened 2026-05-03 16:03:20 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @olafgeibig on GitHub (Mar 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3298

What is the issue?

The image recognition is very poor. It can't describe the picture properly and it also can't extract text. It seems to process a very downscaled image because it complains about the text being too small and it makes assumptions about image elements that seem to be likely but aren't true. It hallucinates a lot and makes things up.

Basically this model is very capable. But with Ollama it is unusable and I tried different tags. llava:7b-v1.6-mistral-q5_K_M and llava:7b from the ollama library. Additionally I downloded a Q5_K_M quant of the model from HF and imported it into my local ollama with a self-created modelfile. The problem was always the same.

I used this image, a screenshot from a PDF manual:

Screenshot 2024-03-22 at 18 37 08

Prompt:

extract the text underneath the image: /Users/olaf/Desktop/Screenshot\ 2024-03-22\ at\ 18.37.08.png
Added image '/Users/olaf/Desktop/Screenshot 2024-03-22 at 18.37.08.png'

Ollama response:

The image shows a digital advertisement or informational graphic, but it is too small and low resolution to provide a clear transcription of the text. The visible content includes a large purple logo in the upper right corner with the name "SWITCH" below it, and several smaller graphics or logos aligned vertically along the top and bottom edges of the image, which may be related to the product or service advertised on this webpage.

Please note that due to the low resolution and size of the text in the image, it's difficult to provide an exact transcription of what is written.

It did not even try to extract the text, instead described the image but the description is complete nonsense.

Converting the image to JPG improved the recognition but the model still made things up and the text extraction is just partially correct:

The text underneath the image reads:

"Click on any package result to see above. Otherwise the top row will not be visible. A popup will appear and display the packages in the next step.

The first group of buttons is for the product options. The second set of buttons are for different shipping options. The third set ofbuttons allows you to change your email if needed.

Please note that these buttons might not be active on this page but they will be when you click through from a live site."

Using a python script as described in https://ollama.com/blog/vision-models does not improve anything.

Log output after script execution with the png file:

time=2024-03-22T20:03:14.404+01:00 level=INFO source=dyn_ext_server.go:172 msg="loaded 1 images"
{"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134194}
{"function":"update_slots","level":"INFO","line":1852,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134194}
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   658.77 ms by CLIP (    1.14 ms per image patch)
{"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time     =    6374.43 ms /     1 tokens ( 6374.43 ms per token,     0.16 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.15687684555803996,"slot_id":0,"t_prompt_processing":6374.427,"t_token":6374.427,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212}
{"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time =   11258.49 ms /   321 runs   (   35.07 ms per token,    28.51 tokens per second)","n_decoded":321,"n_tokens_second":28.511811350934032,"slot_id":0,"t_token":35.073183800623056,"t_token_generation":11258.492,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212}
{"function":"print_timings","level":"INFO","line":287,"msg":"          total time =   17632.92 ms","slot_id":0,"t_prompt_processing":6374.427,"t_token_generation":11258.492,"t_total":17632.919,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212}
{"function":"update_slots","level":"INFO","line":1660,"msg":"slot released","n_cache_tokens":322,"n_ctx":2048,"n_past":907,"n_system_tokens":0,"slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212,"truncated":false}
[GIN] 2024/03/22 - 20:03:32 | 200 | 17.686484875s |       127.0.0.1 | POST     "/api/chat"

What did you expect to see?

I compared it with the results from this HF space running the same model in FP16 but most probably with pytorch and transformers: https://huggingface.co/spaces/merve/llava-next

prompt:

extract the text underneath the image

response:

The text underneath the image is as follows:

"Click on any package results in additional information popping up, as seen above. Otherwise, the top row of text buttons represents view filters for seeing and sorting packs differently. The first group of buttons offers to filter packages by their source, either by showing only those by Bitwig, only those by Artists, or only those from Partners (like sound design companies, etc.). Or simply turn off this filter to see all packages

This is a good answer. It is exactly the text beneath the image in the image.

Steps to reproduce

  1. ollama run llava:7b
  2. Use the uploaded picture with the prompt from above

Are there any recent changes that introduced the issue?

No response

OS

macOS

Architecture

arm64

Platform

No response

Ollama version

0.1.29

GPU

Apple

GPU info

M1 Pro

CPU

Apple

Other software

No response

Originally created by @olafgeibig on GitHub (Mar 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3298 ### What is the issue? The image recognition is very poor. It can't describe the picture properly and it also can't extract text. It seems to process a very downscaled image because it complains about the text being too small and it makes assumptions about image elements that seem to be likely but aren't true. It hallucinates a lot and makes things up. Basically this model is very capable. But with Ollama it is unusable and I tried different tags. llava:7b-v1.6-mistral-q5_K_M and llava:7b from the ollama library. Additionally I downloded a Q5_K_M quant of the model from HF and imported it into my local ollama with a self-created modelfile. The problem was always the same. I used this image, a screenshot from a PDF manual: ![Screenshot 2024-03-22 at 18 37 08](https://github.com/ollama/ollama/assets/295644/b126b65a-4fbb-41d1-9f44-d2682f17dd9c) Prompt: > extract the text underneath the image: /Users/olaf/Desktop/Screenshot\ 2024-03-22\ at\ 18.37.08.png > Added image '/Users/olaf/Desktop/Screenshot 2024-03-22 at 18.37.08.png' Ollama response: > The image shows a digital advertisement or informational graphic, but it is too small and low resolution to provide a clear transcription of the text. The visible content includes a large purple logo in the upper right corner with the name "SWITCH" below it, and several smaller graphics or logos aligned vertically along the top and bottom edges of the image, which may be related to the product or service advertised on this webpage. > > Please note that due to the low resolution and size of the text in the image, it's difficult to provide an exact transcription of what is written. It did not even try to extract the text, instead described the image but the description is complete nonsense. Converting the image to JPG improved the recognition but the model still made things up and the text extraction is just partially correct: > The text underneath the image reads: > > "Click on any package result to see above. Otherwise the top row will not be visible. A popup will appear and display the packages in the next step. > > The first group of buttons is for the product options. The second set of buttons are for different shipping options. The third set ofbuttons allows you to change your email if needed. > > Please note that these buttons might not be active on this page but they will be when you click through from a live site." Using a python script as described in https://ollama.com/blog/vision-models does not improve anything. Log output after script execution with the png file: ``` time=2024-03-22T20:03:14.404+01:00 level=INFO source=dyn_ext_server.go:172 msg="loaded 1 images" {"function":"launch_slot_with_data","level":"INFO","line":833,"msg":"slot is processing task","slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134194} {"function":"update_slots","level":"INFO","line":1852,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134194} encode_image_with_clip: image embedding created: 576 tokens encode_image_with_clip: image encoded in 658.77 ms by CLIP ( 1.14 ms per image patch) {"function":"print_timings","level":"INFO","line":264,"msg":"prompt eval time = 6374.43 ms / 1 tokens ( 6374.43 ms per token, 0.16 tokens per second)","n_prompt_tokens_processed":1,"n_tokens_second":0.15687684555803996,"slot_id":0,"t_prompt_processing":6374.427,"t_token":6374.427,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212} {"function":"print_timings","level":"INFO","line":278,"msg":"generation eval time = 11258.49 ms / 321 runs ( 35.07 ms per token, 28.51 tokens per second)","n_decoded":321,"n_tokens_second":28.511811350934032,"slot_id":0,"t_token":35.073183800623056,"t_token_generation":11258.492,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212} {"function":"print_timings","level":"INFO","line":287,"msg":" total time = 17632.92 ms","slot_id":0,"t_prompt_processing":6374.427,"t_token_generation":11258.492,"t_total":17632.919,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212} {"function":"update_slots","level":"INFO","line":1660,"msg":"slot released","n_cache_tokens":322,"n_ctx":2048,"n_past":907,"n_system_tokens":0,"slot_id":0,"task_id":637,"tid":"0x172f5f000","timestamp":1711134212,"truncated":false} [GIN] 2024/03/22 - 20:03:32 | 200 | 17.686484875s | 127.0.0.1 | POST "/api/chat" ``` ### What did you expect to see? I compared it with the results from this HF space running the same model in FP16 but most probably with pytorch and transformers: https://huggingface.co/spaces/merve/llava-next prompt: > extract the text underneath the image response: > The text underneath the image is as follows: > > "Click on any package results in additional information popping up, as seen above. Otherwise, the top row of text buttons represents view filters for seeing and sorting packs differently. The first group of buttons offers to filter packages by their source, either by showing only those by Bitwig, only those by Artists, or only those from Partners (like sound design companies, etc.). Or simply turn off this filter to see all packages This is a good answer. It is exactly the text beneath the image in the image. ### Steps to reproduce 1. ollama run llava:7b 2. Use the uploaded picture with the prompt from above ### Are there any recent changes that introduced the issue? _No response_ ### OS macOS ### Architecture arm64 ### Platform _No response_ ### Ollama version 0.1.29 ### GPU Apple ### GPU info M1 Pro ### CPU Apple ### Other software _No response_
GiteaMirror added the bug label 2026-05-03 16:03:20 -05:00
Author
Owner

@MorrisLu-Taipei commented on GitHub (Mar 23, 2024):

no supprise , llava is aways poor in image recognition, is not ollama issue, maybe have to ask llava team.

<!-- gh-comment-id:2016242043 --> @MorrisLu-Taipei commented on GitHub (Mar 23, 2024): no supprise , llava is aways poor in image recognition, is not ollama issue, maybe have to ask llava team.
Author
Owner

@olafgeibig commented on GitHub (Mar 23, 2024):

I disagree. The model performs very well in that regard, better than any other model <= 7B that I tested. Please read the "What did you expect to see?" section. The same model served in a HF spaces is doing near perfect OCR and description of the image.

<!-- gh-comment-id:2016425811 --> @olafgeibig commented on GitHub (Mar 23, 2024): I disagree. The model performs very well in that regard, better than any other model <= 7B that I tested. Please read the "What did you expect to see?" section. The same model served in a HF spaces is doing near perfect OCR and description of the image.
Author
Owner

@igorschlum commented on GitHub (Mar 23, 2024):

Hi @olafgeibig

There are 3 Llava Models, the biggest is the best. I crop the image you provided to keep only the part with the text in Black and White and asked Llava : can you give me the text written on this image? /Users/igor/Documents/readme
... .png

The result is:

The image contains text that appears to be instructions for filtering
packages or content, possibly on a website or software platform. Here is
the transcribed text:

"Click on any package result is additional information popping up as seen
above otherwise the top row of text buttons represents view filters for
seeing and sorting packs differently. The first group of buttons offers to
filter packages by their source such as Only By Blizz, only by Artists, or
simply turn off this page. The second group of buttons offers to filter
package by their status installed/uninstalled meaning if they are
available on your computer. Finally, the third group offers sorting
options one option is to sort packages alphabetically with the Name 1
button Or choose to sort packages based on their release date with the
Recommended button."

When I use the same picture and same prompt but when I use the Llava13b model, I got this answer:

The text in the image appears to be a bit blurry and stretched due to the
perspective, making it difficult to read accurately. However, from what is
visible, it seems to be a description or instructions related to sorting
or filtering packages on a website or application. If you provide a
clearer image or more details about the text, I could try to transcribe it
for you.

I used a old llava 1.3 from 3 month ago (a 7b model and it was working), but the recognition was not as good in 1.3 than in 1.6 so the result was :

The first group of buttons offers filters for package results, additional
information popping up as seen above any row of text buttons representing
various views. Filters are available from the source by showing on either
BittyWig (artists), a like sound design company, or simply by using
Partner's like sound design. They have also included simple options for
sorting packages alphabetically with their name button and choosing to
sort by date when the release is announced, which will be displayed on
this page.

<!-- gh-comment-id:2016552037 --> @igorschlum commented on GitHub (Mar 23, 2024): Hi @olafgeibig There are 3 Llava Models, the biggest is the best. I crop the image you provided to keep only the part with the text in Black and White and asked Llava : can you give me the text written on this image? /Users/igor/Documents/readme ... .png The result is: The image contains text that appears to be instructions for filtering packages or content, possibly on a website or software platform. Here is the transcribed text: "Click on any package result is additional information popping up as seen above otherwise the top row of text buttons represents view filters for seeing and sorting packs differently. The first group of buttons offers to filter packages by their source such as Only By Blizz, only by Artists, or simply turn off this page. The second group of buttons offers to filter package by their status installed/uninstalled meaning if they are available on your computer. Finally, the third group offers sorting options one option is to sort packages alphabetically with the Name 1 button Or choose to sort packages based on their release date with the Recommended button." When I use the same picture and same prompt but when I use the Llava13b model, I got this answer: The text in the image appears to be a bit blurry and stretched due to the perspective, making it difficult to read accurately. However, from what is visible, it seems to be a description or instructions related to sorting or filtering packages on a website or application. If you provide a clearer image or more details about the text, I could try to transcribe it for you. I used a old llava 1.3 from 3 month ago (a 7b model and it was working), but the recognition was not as good in 1.3 than in 1.6 so the result was : The first group of buttons offers filters for package results, additional information popping up as seen above any row of text buttons representing various views. Filters are available from the source by showing on either BittyWig (artists), a like sound design company, or simply by using Partner's like sound design. They have also included simple options for sorting packages alphabetically with their name button and choosing to sort by date when the release is announced, which will be displayed on this page.
Author
Owner

@marksalpeter commented on GitHub (Mar 24, 2024):

Have the same issue. The Huggingface deployments of 1.6 are able to read text out of most of the document images I've given it. When I give the same images to ollama is it unable to read them. I'm on an intel mac though.

<!-- gh-comment-id:2016935859 --> @marksalpeter commented on GitHub (Mar 24, 2024): Have the same issue. The Huggingface deployments of 1.6 are able to read text out of most of the document images I've given it. When I give the same images to ollama is it unable to read them. I'm on an intel mac though.
Author
Owner

@igorschlum commented on GitHub (Mar 24, 2024):

@marksalpeter Which version of the model do you use? It seems like the llava:34b-v1.6 works better. I'm trying to install this version from Hugging Face: https://huggingface.co/llava-hf/llava-v1.6-34b-hf but have to spend some time doing that.

<!-- gh-comment-id:2016959391 --> @igorschlum commented on GitHub (Mar 24, 2024): @marksalpeter Which version of the model do you use? It seems like the llava:34b-v1.6 works better. I'm trying to install this version from Hugging Face: https://huggingface.co/llava-hf/llava-v1.6-34b-hf but have to spend some time doing that.
Author
Owner

@olafgeibig commented on GitHub (Mar 24, 2024):

My issue is with the 7B model. I can't run a 34B model. Anyway, the point is, that there must be a bug in Ollama or probably the underlying llama.cpp.

<!-- gh-comment-id:2016980402 --> @olafgeibig commented on GitHub (Mar 24, 2024): My issue is with the 7B model. I can't run a 34B model. Anyway, the point is, that there must be a bug in Ollama or probably the underlying llama.cpp.
Author
Owner

@igorschlum commented on GitHub (Mar 25, 2024):

@olafgeibig On hugging face, did you test with the 7B model? The 34B model works fine with Ollama, so it could be an issue with the 7B model itself.

<!-- gh-comment-id:2017458885 --> @igorschlum commented on GitHub (Mar 25, 2024): @olafgeibig On hugging face, did you test with the 7B model? The 34B model works fine with Ollama, so it could be an issue with the 7B model itself.
Author
Owner

@olafgeibig commented on GitHub (Mar 25, 2024):

Yep, the HF space where they demo the new transformers feature to use Llava models is using the 7B model. See yourself: https://huggingface.co/spaces/merve/llava-next/blob/main/app.py

<!-- gh-comment-id:2017506398 --> @olafgeibig commented on GitHub (Mar 25, 2024): Yep, the HF space where they demo the new transformers feature to use Llava models is using the 7B model. See yourself: https://huggingface.co/spaces/merve/llava-next/blob/main/app.py
Author
Owner

@olafgeibig commented on GitHub (Apr 3, 2024):

bump

<!-- gh-comment-id:2035281313 --> @olafgeibig commented on GitHub (Apr 3, 2024): bump
Author
Owner

@CoderCowMoo commented on GitHub (Apr 6, 2024):

Can confirm,
image
image
And the hf spaces says that the llava 1.6 mistral 7b variant is being used.

Try LLaVA NeXT in this demo (more specifically, the Mistral-7B variant).

<!-- gh-comment-id:2041062364 --> @CoderCowMoo commented on GitHub (Apr 6, 2024): Can confirm, ![image](https://github.com/ollama/ollama/assets/84396762/221d2f82-3415-4523-8c30-035e444236bb) ![image](https://github.com/ollama/ollama/assets/84396762/94bdafd6-08a3-4b39-8509-bf94523a058b) And the hf spaces says that the llava 1.6 mistral 7b variant is being used. > Try [LLaVA NeXT](https://huggingface.co/docs/transformers/main/en/model_doc/llava_next) in this demo (more specifically, the [Mistral-7B variant](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)).
Author
Owner

@igorschlum commented on GitHub (Apr 18, 2024):

I did again the test with ollama run llava:13b with version 0.1.32
and I got this answer that is quite good as AI always rewrite the text it recognize thru OCR.

Sure, the text written in the image is as follows:

"Click on any package results is add all information pop-up to see the top
row of text button represents view showing only packages that are
different from their source.

The first group of buttons offers filters by their source - either by the
artists or not those by partner companies (e.g. "Bandcamp" or
"SoundCloud"). The second group of buttons offers filters by their status

  • either available for download but are A+, meaning they've been
    pre-installed and can be used simply by clicking on the download link, or
    again, simply turn off this feature to see all packages in one place.

Finally, the third group of buttons offers sort options. One option is to
sort by name alphabetically with the "Name" button. The other is to choose
a sort based on their release date with the "Release Date" button.
Finally, simply turn off this feature to see all packages in one place."

<!-- gh-comment-id:2065303272 --> @igorschlum commented on GitHub (Apr 18, 2024): I did again the test with ollama run llava:13b with version 0.1.32 and I got this answer that is quite good as AI always rewrite the text it recognize thru OCR. Sure, the text written in the image is as follows: "Click on any package results is add all information pop-up to see the top row of text button represents view showing only packages that are different from their source. The first group of buttons offers filters by their source - either by the artists or not those by partner companies (e.g. "Bandcamp" or "SoundCloud"). The second group of buttons offers filters by their status - either available for download but are A+, meaning they've been pre-installed and can be used simply by clicking on the download link, or again, simply turn off this feature to see all packages in one place. Finally, the third group of buttons offers sort options. One option is to sort by name alphabetically with the "Name" button. The other is to choose a sort based on their release date with the "Release Date" button. Finally, simply turn off this feature to see all packages in one place."
Author
Owner

@igorschlum commented on GitHub (Apr 19, 2024):

I loaded lava 7b with version 0.1.32 and I get a good result with this image:

ollama run llava:7b

can you give me the full text of this image ? /Users/igor/Documents/readme.png'

Added image '/Users/igor/Documents/readme.png'
Sure, here's the full text from the image:

click on any package result is additonal information popping up as seen 
above otherwise the top row of buttons are different filters for seeing 
all products by sorting them by price low to high or showing only items 
that have been installed either by me or someone else. The first group of 
buttons offers filter packages by their source. These are not shown but 
they are there if you need to see them.

Below this is the second group of buttons which offer different types of 
packages. You can use a keyboard shortcut to navigate through these 
buttons. However, the package installer is disabled for now. There may be 
some hidden options that you cannot find. The package installer does not 
work without being installed but are available if you simply turn them on.
This is an important message to ensure that users are aware of all 
available packages in order to sort their preferences.

The first set of buttons allows a user to see the name and description of 
each package as well as any buttons that might be associated with it. The 
second row of buttons is disabled for now but simply turning them on will 
allow you to see all packages that are not installed or simply turn off 
this option if you wish to disable it.

The last set of buttons allows you to choose the third group of options 
which are not fully visible in this image and do not have any associated 
keyboard shortcuts. These options are also not fully visible in this 
image. It is important to note that all packages can be sorted into 
different categories based on their function or the number of packages 
they belong to. The final set of buttons allows you to choose your 
preferences. This is an important message to ensure that users are aware 
of all available packages in order to sort their preferences.

The result is not exaclty the same as the image, but I think that we can close this Issue if @olafgeibig agrees.

<!-- gh-comment-id:2065557912 --> @igorschlum commented on GitHub (Apr 19, 2024): I loaded lava 7b with version 0.1.32 and I get a good result with this image: ollama run llava:7b >>> can you give me the full text of this image ? /Users/igor/Documents/readme.png' Added image '/Users/igor/Documents/readme.png' Sure, here's the full text from the image: ```css click on any package result is additonal information popping up as seen above otherwise the top row of buttons are different filters for seeing all products by sorting them by price low to high or showing only items that have been installed either by me or someone else. The first group of buttons offers filter packages by their source. These are not shown but they are there if you need to see them. Below this is the second group of buttons which offer different types of packages. You can use a keyboard shortcut to navigate through these buttons. However, the package installer is disabled for now. There may be some hidden options that you cannot find. The package installer does not work without being installed but are available if you simply turn them on. This is an important message to ensure that users are aware of all available packages in order to sort their preferences. The first set of buttons allows a user to see the name and description of each package as well as any buttons that might be associated with it. The second row of buttons is disabled for now but simply turning them on will allow you to see all packages that are not installed or simply turn off this option if you wish to disable it. The last set of buttons allows you to choose the third group of options which are not fully visible in this image and do not have any associated keyboard shortcuts. These options are also not fully visible in this image. It is important to note that all packages can be sorted into different categories based on their function or the number of packages they belong to. The final set of buttons allows you to choose your preferences. This is an important message to ensure that users are aware of all available packages in order to sort their preferences. ``` The result is not exaclty the same as the image, but I think that we can close this Issue if @olafgeibig agrees.
Author
Owner

@hx507 commented on GitHub (May 3, 2024):

Also seeing the same issue where llava from ollama performs significantly worse than other web hosted version.

I loaded lava 7b with version 0.1.32 and I get a good result with this image:

Trying the exact same thing but the result seems very bad with latest ollama.

>>> can you give me the full text of this image ? /home/xxx/Desktop/readme.png 
Added image '/home/xxx/Desktop/readme.png'
 The image contains a humorous meme with a visual pun. Here's the text in a semi-transparent format:

"DON'T PAY ANY ATTENTION TO THE DISTINCT LACK OF LOGICAL SEQUENCES OR THE CLEAR LACK OF COHERENCE BECAUSE THIS IS A MEME" 

I was able to give the ollama 34b a try. However, the 34b just keeps complaining the image is blurry.

>>> can you give me the full text of this image ? /home/xxx/Desktop/readme.png
Added image '/home/xxx/Desktop/readme.png'
The image appears to contain a series of text bubbles, likely from an online platform or messaging service, with the word "DUPED" in each one. 
The text is not fully visible due to the angle and resolution of the image, but it seems to be repetitive, possibly indicating a pattern or 
trend being discussed within those messages.

<!-- gh-comment-id:2093785470 --> @hx507 commented on GitHub (May 3, 2024): Also seeing the same issue where llava from ollama performs significantly worse than other web hosted version. > I loaded lava 7b with version 0.1.32 and I get a good result with this image: Trying the exact same thing but the result seems very bad with latest ollama. ``` >>> can you give me the full text of this image ? /home/xxx/Desktop/readme.png Added image '/home/xxx/Desktop/readme.png' The image contains a humorous meme with a visual pun. Here's the text in a semi-transparent format: "DON'T PAY ANY ATTENTION TO THE DISTINCT LACK OF LOGICAL SEQUENCES OR THE CLEAR LACK OF COHERENCE BECAUSE THIS IS A MEME" ``` I was able to give the ollama 34b a try. However, the 34b just keeps complaining the image is blurry. ``` >>> can you give me the full text of this image ? /home/xxx/Desktop/readme.png Added image '/home/xxx/Desktop/readme.png' The image appears to contain a series of text bubbles, likely from an online platform or messaging service, with the word "DUPED" in each one. The text is not fully visible due to the angle and resolution of the image, but it seems to be repetitive, possibly indicating a pattern or trend being discussed within those messages. ```
Author
Owner

@hx507 commented on GitHub (May 3, 2024):

Interestingly, restarting ollama server makes the first image query work.

For anything other than the first image query uploaded (even with a fresh client session), the model will just output nonsense or complain about it being too blurry.

<!-- gh-comment-id:2093795335 --> @hx507 commented on GitHub (May 3, 2024): Interestingly, restarting ollama server makes the first image query work. For anything other than the first image query uploaded (even with a fresh client session), the model will just output nonsense or complain about it being too blurry.
Author
Owner

@jacopofar commented on GitHub (May 8, 2024):

I had the same issue here. Running the ollama through docker and then

docker exec -it ollama ollama run llava:7b

I get an interactive prompt, where I tried these commands:

  • Describe this image
  • Describe this image <path to a real image on a mounted docker volume>
  • Describe this image <invalid path>

in all cases it happily allucinate and describe an image that does not exist, in case 2 and 3 it outputs "Added image <path>" first but gives no warning about the fact the file does not exist.

I restarted ollama server and this time it described the image correctly.

Still, if I pass a path that does not exist it just invents a description.

I did more tests and it seems the first image I try to describe after a server restart (not just session, the ollama server) is reliable, after that it may or may not work, sometimes it keeps "seeing" the previous image and sometimes nothing at all and generates a random description. I could not find out what creates this broken state.

<!-- gh-comment-id:2100243444 --> @jacopofar commented on GitHub (May 8, 2024): I had the same issue here. Running the ollama through docker and then ```docker exec -it ollama ollama run llava:7b``` I get an interactive prompt, where I tried these commands: * `Describe this image` * `Describe this image <path to a real image on a mounted docker volume>` * `Describe this image <invalid path>` in all cases it happily allucinate and describe an image that does not exist, in case 2 and 3 it outputs "`Added image <path>`" first but gives no warning about the fact the file does not exist. I restarted ollama server and this time it described the image correctly. Still, if I pass a path that does not exist it just invents a description. I did more tests and it seems the first image I try to describe after a server restart (not just session, the ollama server) is reliable, after that it may or may not work, sometimes it keeps "seeing" the previous image and sometimes nothing at all and generates a random description. I could not find out what creates this broken state.
Author
Owner

@hx507 commented on GitHub (May 10, 2024):

Looking at the release note of 0.1.34 I think this is already addressed:

  • Fixed issues with LLaVa models where they would respond incorrectly after the first request

Seems to fix for me on llava:34b.

<!-- gh-comment-id:2105162389 --> @hx507 commented on GitHub (May 10, 2024): Looking at the release note of 0.1.34 I think this is already addressed: > - Fixed issues with LLaVa models where they would respond incorrectly after the first request Seems to fix for me on `llava:34b`.
Author
Owner

@jmorganca commented on GitHub (May 10, 2024):

Hi folks 0.1.34 should fix these issues. Sorry this happened. Thanks for creating an issue 😊

<!-- gh-comment-id:2105375815 --> @jmorganca commented on GitHub (May 10, 2024): Hi folks 0.1.34 should fix these issues. Sorry this happened. Thanks for creating an issue 😊
Author
Owner

@AverageGuy commented on GitHub (Jun 13, 2024):

I have the same problem on 1.43. Using llava:13b-v1.6.
The first time I ran it, it said there was no text. The second time said "Here's a cool picture of my cat" the third time I ran it it said "{
"text": "JOIN THE GLOBAL MOVEMENT FOR A BETTER FUTURE",
"font_size": "28px",
"font_family": "Arial Black",
"background_color": "transparent",
"text_color": "#ffffff"
}
The 4th time: {
"text": "AI-generated art can be an interesting way to explore creativity with technology. It allows us to push boundaries and see what is possible, even if it may not always be perfect or conventional."
}
The image is a 2000x191 pixel grey image of a set of transactions from a CC statement. Here's the code:
`import ollama
import base64

def get_base64_encoded_image(image_path):
with open(image_path, "rb") as img_file:
return base64.b64encode(img_file.read()).decode('utf-8')

b64 = get_base64_encoded_image("/home/jwl/3LineTrans.png")

Initialize the Ollama client

ollama_client = ollama.Client()

Define the path to your image

image_path = '/home/jwl/3LineTrans.png'

Prepare the message to send to the LLaVA model

message = {
'role': 'user',
'content': 'Print the text found in this image in json format.',
'image': {
'type': 'base64',
'media_type': 'image/png',
'data': b64
}
}

Use the ollama.chat function to send the image and retrieve

response = ollama_client.chat(
#model="llava:latest", # Specify the desired LLaVA model size
model="llava:13b-v1.6", # Specify the desired LLaVA model size
messages=[message]
)

Print the model's description of the image

print(response)
print(response['message']['content'])
`

<!-- gh-comment-id:2165547029 --> @AverageGuy commented on GitHub (Jun 13, 2024): I have the same problem on 1.43. Using llava:13b-v1.6. The first time I ran it, it said there was no text. The second time said "Here's a cool picture of my cat" the third time I ran it it said "{ "text": "JOIN THE GLOBAL MOVEMENT FOR A BETTER FUTURE", "font_size": "28px", "font_family": "Arial Black", "background_color": "transparent", "text_color": "#ffffff" } The 4th time: { "text": "AI-generated art can be an interesting way to explore creativity with technology. It allows us to push boundaries and see what is possible, even if it may not always be perfect or conventional." } The image is a 2000x191 pixel grey image of a set of transactions from a CC statement. Here's the code: `import ollama import base64 def get_base64_encoded_image(image_path): with open(image_path, "rb") as img_file: return base64.b64encode(img_file.read()).decode('utf-8') b64 = get_base64_encoded_image("/home/jwl/3LineTrans.png") # Initialize the Ollama client ollama_client = ollama.Client() # Define the path to your image image_path = '/home/jwl/3LineTrans.png' # Prepare the message to send to the LLaVA model message = { 'role': 'user', 'content': 'Print the text found in this image in json format.', 'image': { 'type': 'base64', 'media_type': 'image/png', 'data': b64 } } # Use the ollama.chat function to send the image and retrieve response = ollama_client.chat( #model="llava:latest", # Specify the desired LLaVA model size model="llava:13b-v1.6", # Specify the desired LLaVA model size messages=[message] ) # Print the model's description of the image print(response) print(response['message']['content']) `
Author
Owner

@AverageGuy commented on GitHub (Jun 13, 2024):

I don't know what I did to mess up the format. I used the <> button which inserted a couple of ticks and I pasted the text into the texts. It doesn't seem to like comments.

<!-- gh-comment-id:2165568151 --> @AverageGuy commented on GitHub (Jun 13, 2024): I don't know what I did to mess up the format. I used the <> button which inserted a couple of ticks and I pasted the text into the texts. It doesn't seem to like comments.
Author
Owner

@samyan commented on GitHub (Aug 12, 2024):

Guys any update with this for llava 8b?

<!-- gh-comment-id:2284797970 --> @samyan commented on GitHub (Aug 12, 2024): Guys any update with this for llava 8b?
Author
Owner

@Falkensmaze0 commented on GitHub (Oct 11, 2024):

@olafgeibig On hugging face, did you test with the 7B model? The 34B model works fine with Ollama, so it could be an issue with the 7B model itself.

Currently using the Llava:34b-v1.6-fp16 with Ollama, and I've had practically zero luck with any type of image reading. The infamous "picture is blurry or too small" has been a constant, unfortunately.

I've tried calling the ollama.chat() from ollama's python library, similar results, except the model complains about the start/stop parameters in the template as well. (template is original)

UPDATE
I recreated the model in ollama by setting the Temperature at 0.1. Now it's working with approx. 85% accuracy. Jumps lines occasionally and changes units, still.

UPDATE2
Via prompting and using the phrase "refrain from rephrasing the text, or any alterations whatsoever. report only what is in the image" the accuracy got upped even more. I suppose proper image pre-processing coupled with LLM post-processing could achieve near perfect results.
It might be noteworthy that fine-tuning the model with your subject matter dictionary might prove helpful if anyone intends to use it on specialized matters or academic literature.

OS: Ubuntu 24.02 LTS
CPU: Intel 14900K
GPU: RTX 4090 TUF
RAM: 128GB DDR5

<!-- gh-comment-id:2407755821 --> @Falkensmaze0 commented on GitHub (Oct 11, 2024): > @olafgeibig On hugging face, did you test with the 7B model? The 34B model works fine with Ollama, so it could be an issue with the 7B model itself. Currently using the Llava:34b-v1.6-fp16 with Ollama, and I've had practically zero luck with any type of image reading. The infamous "picture is blurry or too small" has been a constant, unfortunately. I've tried calling the ollama.chat() from ollama's python library, similar results, except the model complains about the start/stop parameters in the template as well. (template is original) **UPDATE** I recreated the model in ollama by setting the Temperature at 0.1. Now it's working with approx. 85% accuracy. Jumps lines occasionally and changes units, still. **UPDATE2** Via prompting and using the phrase _"refrain from rephrasing the text, or any alterations whatsoever. report only what is in the image"_ the accuracy got upped even more. I suppose proper image pre-processing coupled with LLM post-processing could achieve near perfect results. It might be noteworthy that fine-tuning the model with your subject matter dictionary might prove helpful if anyone intends to use it on specialized matters or academic literature. OS: Ubuntu 24.02 LTS CPU: Intel 14900K GPU: RTX 4090 TUF RAM: 128GB DDR5
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64067