[GH-ISSUE #746] Support multi-modal models #62390

Closed
opened 2026-05-03 08:45:32 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @arian81 on GitHub (Oct 10, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/746

Originally assigned to: @pdevine on GitHub.

This is one of the best open source multi modals based on llama 7 currently. It would nice to be able to host it in ollama.
https://llava-vl.github.io/

Originally created by @arian81 on GitHub (Oct 10, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/746 Originally assigned to: @pdevine on GitHub. This is one of the best open source multi modals based on llama 7 currently. It would nice to be able to host it in ollama. https://llava-vl.github.io/
GiteaMirror added the feature request label 2026-05-03 08:45:32 -05:00
Author
Owner

@ryansereno commented on GitHub (Oct 10, 2023):

Came here looking for this, to see if the discussion had begun surrounding this.
Curious to see what will be required to make this happen.

Edit: Progress is being made upstream in llama.cpp to support this.

<!-- gh-comment-id:1755941917 --> @ryansereno commented on GitHub (Oct 10, 2023): Came here looking for this, to see if the discussion had begun surrounding this. Curious to see what will be required to make this happen. Edit: Progress is being made [upstream](https://github.com/ggerganov/llama.cpp/pull/3436) in llama.cpp to support this.
Author
Owner

@spielhoelle commented on GitHub (Oct 13, 2023):

The PR @ryansereno mentioned is merged and in master now. How can we run this in ollama?

<!-- gh-comment-id:1761157688 --> @spielhoelle commented on GitHub (Oct 13, 2023): The PR @ryansereno mentioned is [merged and in master](https://github.com/ggerganov/llama.cpp/tree/master/examples/llava) now. How can we run this in ollama?
Author
Owner

@marscod commented on GitHub (Oct 15, 2023):

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

<!-- gh-comment-id:1763518683 --> @marscod commented on GitHub (Oct 15, 2023): I could successfully run `llava-v1.5-7b` and it is available at: https://ollama.ai/marscod/llava but I have to map an `image` parameter to `llama.cpp`'s image parameter. Maybe within the prompt?
Author
Owner

@chigkim commented on GitHub (Oct 16, 2023):

It would be good to have file reader command in the prompt like /read file.jpg for this.

<!-- gh-comment-id:1763658789 --> @chigkim commented on GitHub (Oct 16, 2023): It would be good to have file reader command in the prompt like /read file.jpg for this.
Author
Owner

@hugh-min commented on GitHub (Oct 17, 2023):

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

<!-- gh-comment-id:1765866173 --> @hugh-min commented on GitHub (Oct 17, 2023): > I could successfully run `llava-v1.5-7b` and it is available at: https://ollama.ai/marscod/llava but I have to map an `image` parameter to `llama.cpp`'s image parameter. Maybe within the prompt? Could you elaborate on how to map an image within ollama?
Author
Owner

@Bortus-AI commented on GitHub (Oct 28, 2023):

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

I would like to know as well. Thanks

<!-- gh-comment-id:1783894513 --> @Bortus-AI commented on GitHub (Oct 28, 2023): > > I could successfully run `llava-v1.5-7b` and it is available at: https://ollama.ai/marscod/llava but I have to map an `image` parameter to `llama.cpp`'s image parameter. Maybe within the prompt? > > Could you elaborate on how to map an image within ollama? I would like to know as well. Thanks
Author
Owner

@tmc commented on GitHub (Oct 29, 2023):

it seems a couple of interface design decisions are are play: 1) how to represent this in the http api and 2) what the user/cli interface should be.

I want to note/highlight that the folks hacking on iTerm2 have done some work that may be relevant in the cli context here: https://iterm2.com/documentation-images.html

For the HTTP interface I'd suggest taking some inspiration of how OpenAI is folding in image data may be useful. I did a bit of protocol decoding and the TL;DR of how they do it is upload to blob store then include a special message type in the completion message list.

There's also a/the consideration of if it's an ollama concern to allow annotation of an incoming image to support highlighting part of the image. That feels a bit out of scope to start but perhaps the design should keep that in mind.

<!-- gh-comment-id:1784017191 --> @tmc commented on GitHub (Oct 29, 2023): it seems a couple of interface design decisions are are play: 1) how to represent this in the http api and 2) what the user/cli interface should be. I want to note/highlight that the folks hacking on iTerm2 have done some work that may be relevant in the cli context here: https://iterm2.com/documentation-images.html For the HTTP interface I'd suggest taking some inspiration of how OpenAI is folding in image data may be useful. I did a bit of protocol decoding and the TL;DR of how they do it is upload to blob store then include a special message type in the completion message list. There's also a/the consideration of if it's an ollama concern to allow annotation of an incoming image to support highlighting part of the image. That feels a bit out of scope to start but perhaps the design should keep that in mind.
Author
Owner

@sausheong commented on GitHub (Nov 3, 2023):

I could successfully run llava-v1.5-7b and it is available at: https://ollama.ai/marscod/llava but I have to map an image parameter to llama.cpp's image parameter. Maybe within the prompt?

Could you elaborate on how to map an image within ollama?

I would like to know as well. Thanks

Me too, can explain how to map an image within ollama?

<!-- gh-comment-id:1792573242 --> @sausheong commented on GitHub (Nov 3, 2023): > > > I could successfully run `llava-v1.5-7b` and it is available at: https://ollama.ai/marscod/llava but I have to map an `image` parameter to `llama.cpp`'s image parameter. Maybe within the prompt? > > > > > > Could you elaborate on how to map an image within ollama? > > I would like to know as well. Thanks Me too, can explain how to map an image within ollama?
Author
Owner

@itsPreto commented on GitHub (Nov 11, 2023):

Love that this is marked as closed but everyone still clueless over here lol

<!-- gh-comment-id:1806618166 --> @itsPreto commented on GitHub (Nov 11, 2023): Love that this is marked as closed but everyone still clueless over here lol
Author
Owner

@orkutmuratyilmaz commented on GitHub (Nov 18, 2023):

@marscod thanks for importing the model. Can you type an example of API call, on the model page?

<!-- gh-comment-id:1817577260 --> @orkutmuratyilmaz commented on GitHub (Nov 18, 2023): @marscod thanks for importing the model. Can you type an example of API call, on [the model page](https://ollama.ai/marscod/llava)?
Author
Owner

@mangiucugna commented on GitHub (Nov 21, 2023):

So I figured how to use it, here's the code snippet:

with open("image.jpg", "rb") as f:
      encoded_string = base64.b64encode(f.read()).decode('utf-8')
  data = {"model": "marscod/llava", "prompt": f"USER: {encoded_string} {prompt}\nASSISTANT:", }
  try:
    response = requests.post(url="http://127.0.0.1:11434/api/generate", headers={"Content-Type": "application/json"}, json=data, stream=True)
  except Exception as e:
   # manage exception
  output = ""
  for chunk in response.text.split('\n'):
    chunk = json_repair.loads(chunk)
    if isinstance(chunk, dict):
      output += chunk.get("response") or ""

However it also throws this error: {"error":"error reading llm response: bufio.Scanner: token too long"}

For reference, I prefer using llama.cpp directly with bakllava-1 (way more precise) and the syntax there looks like this:

with open("image.jpg", "rb") as f:
      encoded_string = base64.b64encode(f.read()).decode('utf-8')
  image_data = [{"data": encoded_string, "id": 42}]
  data = {"prompt": f"USER:[img-42] {prompt}.\nASSISTANT:", "n_predict": 4000, "image_data": image_data, "stream": True}
  try:
    response = requests.post(url="http://localhost:8080/completion", headers={"Content-Type": "application/json"}, json=data, stream=True)
  except Exception as e:
    # Manage exception
  output = ""
  for chunk in response.iter_content(chunk_size=128):
    content = chunk.decode().strip().split('\n\n')[0]
    try:
        content_split = content.split('data: ')
        if len(content_split) > 1:
            content_json = json_repair.loads(content_split[1])
            output += content_json["content"]
            yield output
    except Exception as e:
       # Manage exception

This is taken from: https://github.com/mangiucugna/local_multimodal_ai

Hope this helps!

<!-- gh-comment-id:1820714015 --> @mangiucugna commented on GitHub (Nov 21, 2023): So I figured how to use it, here's the code snippet: ``` with open("image.jpg", "rb") as f: encoded_string = base64.b64encode(f.read()).decode('utf-8') data = {"model": "marscod/llava", "prompt": f"USER: {encoded_string} {prompt}\nASSISTANT:", } try: response = requests.post(url="http://127.0.0.1:11434/api/generate", headers={"Content-Type": "application/json"}, json=data, stream=True) except Exception as e: # manage exception output = "" for chunk in response.text.split('\n'): chunk = json_repair.loads(chunk) if isinstance(chunk, dict): output += chunk.get("response") or "" ``` However it also throws this error: `{"error":"error reading llm response: bufio.Scanner: token too long"}` For reference, I prefer using llama.cpp directly with bakllava-1 (way more precise) and the syntax there looks like this: ``` with open("image.jpg", "rb") as f: encoded_string = base64.b64encode(f.read()).decode('utf-8') image_data = [{"data": encoded_string, "id": 42}] data = {"prompt": f"USER:[img-42] {prompt}.\nASSISTANT:", "n_predict": 4000, "image_data": image_data, "stream": True} try: response = requests.post(url="http://localhost:8080/completion", headers={"Content-Type": "application/json"}, json=data, stream=True) except Exception as e: # Manage exception output = "" for chunk in response.iter_content(chunk_size=128): content = chunk.decode().strip().split('\n\n')[0] try: content_split = content.split('data: ') if len(content_split) > 1: content_json = json_repair.loads(content_split[1]) output += content_json["content"] yield output except Exception as e: # Manage exception ``` This is taken from: https://github.com/mangiucugna/local_multimodal_ai Hope this helps!
Author
Owner

@ryansereno commented on GitHub (Nov 21, 2023):

@mangiucugna thank you, will give it a try.
Hadn't heard of Bakllava before, very excited to try it.

<!-- gh-comment-id:1820815186 --> @ryansereno commented on GitHub (Nov 21, 2023): @mangiucugna thank you, will give it a try. Hadn't heard of Bakllava before, very excited to try it.
Author
Owner

@mangiucugna commented on GitHub (Nov 21, 2023):

I imported bakllava-1 locally and did some tests and it performs so badly when compared to the llama.cpp implementation that is unusable.
I suspect that something is going wrong and the data arriving to the model is corrupted and that somehow {"error":"error reading llm response: bufio.Scanner: token too long"} is related.

Happy to share my Modelfile and link to the gguf for anyone to try to reproduce

<!-- gh-comment-id:1821370238 --> @mangiucugna commented on GitHub (Nov 21, 2023): I imported bakllava-1 locally and did some tests and it performs so badly when compared to the llama.cpp implementation that is unusable. I suspect that something is going wrong and the data arriving to the model is corrupted and that somehow `{"error":"error reading llm response: bufio.Scanner: token too long"}` is related. Happy to share my Modelfile and link to the gguf for anyone to try to reproduce
Author
Owner

@Kreijstal commented on GitHub (Dec 9, 2023):

https://github.com/Mozilla-Ocho/llamafile llamafile supports llava-1.5 it would be nice if ollama supported it too

<!-- gh-comment-id:1848395530 --> @Kreijstal commented on GitHub (Dec 9, 2023): https://github.com/Mozilla-Ocho/llamafile llamafile supports llava-1.5 it would be nice if ollama supported it too
Author
Owner

@mak448a commented on GitHub (Dec 15, 2023):

Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux.

<!-- gh-comment-id:1857156901 --> @mak448a commented on GitHub (Dec 15, 2023): Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux.
Author
Owner

@arian81 commented on GitHub (Dec 15, 2023):

Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux.

You probably haven't updated to the latest version of Ollama if you're getting a bunch of Chinese characters as the output.

<!-- gh-comment-id:1857193654 --> @arian81 commented on GitHub (Dec 15, 2023): > Since this is now added, I can't figure out how to upload an image to the model. When I follow the instructions at: https://github.com/jmorganca/ollama/releases/tag/v0.1.15, it describes something completely different than what was in the picture. I'm on Linux. You probably haven't updated to the latest version of Ollama if you're getting a bunch of Chinese characters as the output.
Author
Owner

@orkutmuratyilmaz commented on GitHub (Dec 16, 2023):

I guess that we can consider this issue as completed :)

<!-- gh-comment-id:1858747283 --> @orkutmuratyilmaz commented on GitHub (Dec 16, 2023): I guess that we can consider this issue as completed :)
Author
Owner

@prologic commented on GitHub (Dec 26, 2023):

When I try this I get:

$ ollama run llama2
>>> What's in this image? /Users/prologic/Downloads/IMG_1325.png

I cannot directly view or analyze the image you provided as it is a personal file located on a local computer. However, I can provide some general
information about images and how they can be analyzed.
...

And I'm using the l atest version of ollama:

$ ollama --version
ollama version is 0.1.17
<!-- gh-comment-id:1869247153 --> @prologic commented on GitHub (Dec 26, 2023): When I try this I get: ``` $ ollama run llama2 >>> What's in this image? /Users/prologic/Downloads/IMG_1325.png I cannot directly view or analyze the image you provided as it is a personal file located on a local computer. However, I can provide some general information about images and how they can be analyzed. ... ``` And I'm using the l atest version of ollama: ``` $ ollama --version ollama version is 0.1.17 ```
Author
Owner

@pdevine commented on GitHub (Dec 26, 2023):

@prologic llama2 isn't a multimodal model. You should try:

$ ollama run llava
<!-- gh-comment-id:1869291054 --> @pdevine commented on GitHub (Dec 26, 2023): @prologic llama2 isn't a multimodal model. You should try: ``` $ ollama run llava ```
Author
Owner

@prologic commented on GitHub (Dec 26, 2023):

Ahh! Thanks. When I tried to search for multimodel models the search turend up empty. This is why I wasn't able to figure this out so easily :/ There should be a way to list for and search for multimodel models, even with ollama search (does this sub-command exist?)

<!-- gh-comment-id:1869297313 --> @prologic commented on GitHub (Dec 26, 2023): Ahh! Thanks. When I tried to search for [multimodel models](https://ollama.ai/library?q=multimodal) the search turend up empty. This is why I wasn't able to figure this out so easily :/ There should be a way to list for and search for multimodel models, even with `ollama search` (does this sub-command exist?)
Author
Owner

@schuster-rainer commented on GitHub (Mar 6, 2024):

if you want to use it with langchain. here is what you need to add to the HumanMessage:

 HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": f"data:image/jpeg;base64,{img_base64}",
                    },
                ]
            )
<!-- gh-comment-id:1980846963 --> @schuster-rainer commented on GitHub (Mar 6, 2024): if you want to use it with langchain. here is what you need to add to the HumanMessage: ```python HumanMessage( content=[ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": f"data:image/jpeg;base64,{img_base64}", }, ] ) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62390