[GH-ISSUE #7080] Support for NVLM #66552

Open
opened 2026-05-04 07:23:47 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @mitar on GitHub (Oct 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7080

NVLM is model from Nvidia: https://nvlm-project.github.io/

Originally created by @mitar on GitHub (Oct 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7080 NVLM is model from Nvidia: https://nvlm-project.github.io/
GiteaMirror added the model label 2026-05-04 07:23:47 -05:00
Author
Owner

@duaneking commented on GitHub (Oct 2, 2024):

Yeah looking at he numbers, its behind the new Lamav3 stuff a lil but as they release updates, support for this "open" decoder only model would ne nice. I just worry that its not as open as it seems. The hardware needed to load it is also an issue, at 72gb

<!-- gh-comment-id:2389970104 --> @duaneking commented on GitHub (Oct 2, 2024): Yeah looking at he numbers, its behind the new Lamav3 stuff a lil but as they release updates, support for this "open" decoder only model would ne nice. I just worry that its not as open as it seems. The hardware needed to load it is also an issue, at 72gb
Author
Owner

@TKaluza commented on GitHub (Oct 4, 2024):

So it would run a A100 /H100?

<!-- gh-comment-id:2393723860 --> @TKaluza commented on GitHub (Oct 4, 2024): So it would run a A100 /H100?
Author
Owner

@nonetrix commented on GitHub (Oct 5, 2024):

I'm able to run Qwen 2.5 72B on 4 bits with 16GBs of VRAM and 64GBs of RAM adding up to 80GBs, but I think you are getting parameters and gigabytes confused. The model is presumably trained at FP16 and the accuracy can be lowered with quantization to FP8, FP4, and so on to fit in smaller amounts of memory sacrificing accuracy

Also, I'm running a AMD GPU so running a NVIDIA model is mildly funny. I just hope it's good maybe a step above Qwen 2.5

<!-- gh-comment-id:2394869433 --> @nonetrix commented on GitHub (Oct 5, 2024): I'm able to run Qwen 2.5 72B on 4 bits with 16GBs of VRAM and 64GBs of RAM adding up to 80GBs, but I think you are getting parameters and gigabytes confused. The model is presumably trained at FP16 and the accuracy can be lowered with quantization to FP8, FP4, and so on to fit in smaller amounts of memory sacrificing accuracy Also, I'm running a AMD GPU so running a NVIDIA model is mildly funny. I just hope it's good maybe a step above Qwen 2.5
Author
Owner

@angiopteris commented on GitHub (Oct 7, 2024):

Yes, would be a great addition ! Looking forward for chatting with it

<!-- gh-comment-id:2396956375 --> @angiopteris commented on GitHub (Oct 7, 2024): Yes, would be a great addition ! Looking forward for chatting with it
Author
Owner

@defaultsecurity commented on GitHub (Oct 9, 2024):

Looking forward to it!

<!-- gh-comment-id:2402793180 --> @defaultsecurity commented on GitHub (Oct 9, 2024): Looking forward to it!
Author
Owner

@mikeazurek commented on GitHub (Oct 9, 2024):

I don't think your request is even possible, because I've never seen any multi-modal AI supported by ollama, so I deduce that ollama is bound to only run purely language models, for some technical reason.

<!-- gh-comment-id:2403226638 --> @mikeazurek commented on GitHub (Oct 9, 2024): I don't think your request is even possible, because I've never seen any multi-modal AI supported by ollama, so I deduce that ollama is bound to only run purely language models, for some technical reason.
Author
Owner

@nonetrix commented on GitHub (Oct 9, 2024):

No, it does support them quite a few. This one has weird architecture I think, and Llama cpp seems quite slow to add them or more likely I think doesn't want to because it's out of scope (their project cool with me, not bashing them). I almost wish Ollama forked Llama cpp at this point, or made their own inference engine. Soft fork would probably be ideal, think Xenocara with XOrg where it's just a patch set

<!-- gh-comment-id:2403251133 --> @nonetrix commented on GitHub (Oct 9, 2024): No, it does support them quite a few. This one has weird architecture I think, and Llama cpp seems quite slow to add them or more likely I think doesn't want to because it's out of scope (their project cool with me, not bashing them). I almost wish Ollama forked Llama cpp at this point, or made their own inference engine. Soft fork would probably be ideal, think Xenocara with XOrg where it's just a patch set
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66552