[GH-ISSUE #13594] grounding coordinates #34711

Open
opened 2026-04-22 18:29:06 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Abdulrahman392011 on GitHub (Jan 1, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13594

is there a way to use grounding with vision models other than just prompt engineering. i tried using structured output but it didn't work properly with moondream:1.8b.

i know that some models are good with detecting triggering words in the prompt and bring out the coordinates but it's not really reliable method and it's kinda hit or miss.

i'm interested in the smaller models, the big ones have no issues detecting the trigger words. it's usually the smaller ones that give me a hard time.

Originally created by @Abdulrahman392011 on GitHub (Jan 1, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13594 is there a way to use grounding with vision models other than just prompt engineering. i tried using structured output but it didn't work properly with moondream:1.8b. i know that some models are good with detecting triggering words in the prompt and bring out the coordinates but it's not really reliable method and it's kinda hit or miss. i'm interested in the smaller models, the big ones have no issues detecting the trigger words. it's usually the smaller ones that give me a hard time.
GiteaMirror added the feature request label 2026-04-22 18:29:06 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 2, 2026):

This is model dependent. For moondream, the model author suggests prompting with Bounding box: {object}

<!-- gh-comment-id:3704256780 --> @rick-github commented on GitHub (Jan 2, 2026): This is model dependent. For moondream, the model author suggests prompting with [`Bounding box: {object}`](https://github.com/vikhyat/moondream/issues/101)
Author
Owner

@Abdulrahman392011 commented on GitHub (Jan 2, 2026):

wouldn't it be a good idea to have ollama api deal with the discrepancies and needs of each model. and users have a somewhat unified way of using the models through the api.

on a separate note why is there no small vision models in ollama. something like florence-2 for instance is 0.2b for the base and 0.8b for the large size and it works well on edge devices. moondream is 1.8b and that's the smallest one i can find on ollama.

<!-- gh-comment-id:3705075263 --> @Abdulrahman392011 commented on GitHub (Jan 2, 2026): wouldn't it be a good idea to have ollama api deal with the discrepancies and needs of each model. and users have a somewhat unified way of using the models through the api. on a separate note why is there no small vision models in ollama. something like florence-2 for instance is 0.2b for the base and 0.8b for the large size and it works well on edge devices. moondream is 1.8b and that's the smallest one i can find on ollama.
Author
Owner

@SingularityMan commented on GitHub (Jan 20, 2026):

wouldn't it be a good idea to have ollama api deal with the discrepancies and needs of each model. and users have a somewhat unified way of using the models through the api.

on a separate note why is there no small vision models in ollama. something like florence-2 for instance is 0.2b for the base and 0.8b for the large size and it works well on edge devices. moondream is 1.8b and that's the smallest one i can find on ollama.

You can try qwen3vl-2b but you need to scale each image to 1000x1000 because that's the image resolution it was trained for. Every model is different and Ollama won't be able to accommodate all of them.

<!-- gh-comment-id:3773944779 --> @SingularityMan commented on GitHub (Jan 20, 2026): > wouldn't it be a good idea to have ollama api deal with the discrepancies and needs of each model. and users have a somewhat unified way of using the models through the api. > > on a separate note why is there no small vision models in ollama. something like florence-2 for instance is 0.2b for the base and 0.8b for the large size and it works well on edge devices. moondream is 1.8b and that's the smallest one i can find on ollama. You can try qwen3vl-2b but you need to scale each image to 1000x1000 because that's the image resolution it was trained for. Every model is different and Ollama won't be able to accommodate all of them.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34711