[GH-ISSUE #12227] Add support for agentic models like ui-tars for vision tasks besides captioning and OCR. #70195

Open
opened 2026-05-04 20:38:00 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @SingularityMan on GitHub (Sep 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12227

I don't know how hard it would be to implement, but right now vision models are restricted to captioning/OCR; nothing else works for vision-related tasks. I would very much appreciate if Ollama included additional support for agentic tasks in the future like the following:

  • Object detection
  • Caption-to-phrase grounding

And so forth. If you could at least add reliable support for both, that would be enough for me. I would be more interested in seeing better performance from models like ui-tars and qwen2.5vl later on.

Originally created by @SingularityMan on GitHub (Sep 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12227 I don't know how hard it would be to implement, but right now vision models are restricted to captioning/OCR; nothing else works for vision-related tasks. I would very much appreciate if Ollama included additional support for agentic tasks in the future like the following: - Object detection - Caption-to-phrase grounding And so forth. If you could at least add reliable support for both, that would be enough for me. I would be more interested in seeing better performance from models like `ui-tars` and `qwen2.5vl` later on.
GiteaMirror added the feature request label 2026-05-04 20:38:00 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

vision models are restricted to captioning/OCR; nothing else works for vision-related tasks

$ ollama run qwen2.5vl "what object is in this picture: ./picture.png"
Added image './picture.png'
The picture shows a small white puppy sitting on a concrete surface. The puppy is wearing a red collar with a bell attached to it.
<!-- gh-comment-id:3270800167 --> @rick-github commented on GitHub (Sep 9, 2025): > vision models are restricted to captioning/OCR; nothing else works for vision-related tasks ```console $ ollama run qwen2.5vl "what object is in this picture: ./picture.png" Added image './picture.png' The picture shows a small white puppy sitting on a concrete surface. The puppy is wearing a red collar with a bell attached to it. ```
Author
Owner

@SingularityMan commented on GitHub (Sep 9, 2025):

vision models are restricted to captioning/OCR; nothing else works for vision-related tasks

$ ollama run qwen2.5vl "what object is in this picture: ./picture.png"
Added image './picture.png'
The picture shows a small white puppy sitting on a concrete surface. The puppy is wearing a red collar with a bell attached to it.

Yeah but I'm talking about generating accurate bounding boxes around objects, adding dots to points, etc. etc.

<!-- gh-comment-id:3270821916 --> @SingularityMan commented on GitHub (Sep 9, 2025): > > vision models are restricted to captioning/OCR; nothing else works for vision-related tasks > > $ ollama run qwen2.5vl "what object is in this picture: ./picture.png" > Added image './picture.png' > The picture shows a small white puppy sitting on a concrete surface. The puppy is wearing a red collar with a bell attached to it. Yeah but I'm talking about generating accurate bounding boxes around objects, adding dots to points, etc. etc.
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

$ huggingface-cli download --local-dir=. ByteDance-Seed/UI-TARS-1.5-7B
$ ollama create ui-tars-1.5:7b-fp16
$ echo FROM ui-tars-1.5:7b-fp16 > Modelfile
$ ollama show --modelfile qwen2.5vl | grep -v "^FROM" >> Modelfile
$ ollama create ui-tars-1.5:7b-q4_K_M -q q4_K_M
$ ollama run ui-tars-1.5:7b-q4_K_M draw bounding boxes around the objects in this picture: ./picture.png
Added image './picture.png'
```json
[
        {"bbox_2d": [159, 186], "label": "puppy"}
]
```

This is not a use case for me so I don't know if this is useful.

<!-- gh-comment-id:3271038410 --> @rick-github commented on GitHub (Sep 9, 2025): ````console $ huggingface-cli download --local-dir=. ByteDance-Seed/UI-TARS-1.5-7B $ ollama create ui-tars-1.5:7b-fp16 $ echo FROM ui-tars-1.5:7b-fp16 > Modelfile $ ollama show --modelfile qwen2.5vl | grep -v "^FROM" >> Modelfile $ ollama create ui-tars-1.5:7b-q4_K_M -q q4_K_M $ ollama run ui-tars-1.5:7b-q4_K_M draw bounding boxes around the objects in this picture: ./picture.png Added image './picture.png' ```json [ {"bbox_2d": [159, 186], "label": "puppy"} ] ``` ```` This is not a use case for me so I don't know if this is useful.
Author
Owner

@SingularityMan commented on GitHub (Sep 9, 2025):

$ huggingface-cli download --local-dir=. ByteDance-Seed/UI-TARS-1.5-7B
$ ollama create ui-tars-1.5:7b-fp16
$ echo FROM ui-tars-1.5:7b-fp16 > Modelfile
$ ollama show --modelfile qwen2.5vl | grep -v "^FROM" >> Modelfile
$ ollama create ui-tars-1.5:7b-q4_K_M -q q4_K_M
$ ollama run ui-tars-1.5:7b-q4_K_M draw bounding boxes around the objects in this picture: ./picture.png
Added image './picture.png'

[
        {"bbox_2d": [159, 186], "label": "puppy"}
]

This is not a use case for me so I don't know if this is useful.

I have to check the accuracy of that bounding box. Also, it generated a point, not a box.
Usually a bounding box is generated in the form of [x1, y1, x2, y2].

The problem isn't that these models can't generate bounding boxes, its that they can't generate them accurately. If you run a python script, parse one of those bounding boxes generated in Ollama and attempt to click the center of that box ((x1+x2)//2, (y1+y2)//2) using PyautoGUI, you'd normally get wildly inaccurate results that are way off the position of the element on the screen.

But when you do it in transformers, it is spot on and clicks exactly where you need it because transformers has support for this kind of stuff. But for my framework, I need that in Ollama, not transformers. This is the issue I'm talking about. I don't know if Ollama modifies the image before feeding it to the model or if the model needs some kind of pre-processing/post-processing of the image before interacting with the UI element, but my results in Ollama have not yielded any accuracy on that front.

<!-- gh-comment-id:3271105670 --> @SingularityMan commented on GitHub (Sep 9, 2025): > $ huggingface-cli download --local-dir=. ByteDance-Seed/UI-TARS-1.5-7B > $ ollama create ui-tars-1.5:7b-fp16 > $ echo FROM ui-tars-1.5:7b-fp16 > Modelfile > $ ollama show --modelfile qwen2.5vl | grep -v "^FROM" >> Modelfile > $ ollama create ui-tars-1.5:7b-q4_K_M -q q4_K_M > $ ollama run ui-tars-1.5:7b-q4_K_M draw bounding boxes around the objects in this picture: ./picture.png > Added image './picture.png' > ```json > [ > {"bbox_2d": [159, 186], "label": "puppy"} > ] > ``` > This is not a use case for me so I don't know if this is useful. I have to check the accuracy of that bounding box. Also, it generated a point, not a box. Usually a bounding box is generated in the form of `[x1, y1, x2, y2]`. The problem isn't that these models can't generate bounding boxes, its that they can't generate them accurately. If you run a python script, parse one of those bounding boxes generated in Ollama and attempt to click the center of that box ((x1+x2)//2, (y1+y2)//2) using `PyautoGUI`, you'd normally get wildly inaccurate results that are *way off* the position of the element on the screen. But when you do it in transformers, it is spot on and clicks exactly where you need it because transformers has support for this kind of stuff. But for my framework, I need that in Ollama, not transformers. This is the issue I'm talking about. I don't know if Ollama modifies the image before feeding it to the model or if the model needs some kind of pre-processing/post-processing of the image before interacting with the UI element, but my results in Ollama have not yielded any accuracy on that front.
Author
Owner

@rick-github commented on GitHub (Oct 4, 2025):

https://github.com/ggml-org/llama.cpp/issues/13694

<!-- gh-comment-id:3368115458 --> @rick-github commented on GitHub (Oct 4, 2025): https://github.com/ggml-org/llama.cpp/issues/13694
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70195