[GH-ISSUE #6502] ONNX backend runtime support to simplify HW support? #4094

Open
opened 2026-04-12 14:59:48 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @TheSpaceGod on GitHub (Aug 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6502

A repetitive issue I see coming up over and over again is people not being able to run models on their hardware for any number of reasons, one of the biggest being that llama.cpp has not incorporated the proper level of support for their chosen hardware manufacture or hardware SDK. If I understand the goal of Ollama correctly to be a nice easy way of running LLMs with an API layer over lower level LLM libraries, would it make sense to heavily consider moving to a single solution runtime that can provide support of most all hardware configurations (aka ONNX)?

These are some of the most commented PRs currently and there will probably be more to come from increasingly smaller/niche HW vendors (examples: Ascend NPU or Google Coral TPU):

Would it maybe make more sense to offload this hard work to the ONNX runtime folks and keep Ollama focused on what it is good at, being a great API for LLM inference? I realize this would require a massive amount of work to implement yet another backend, but I am making this ticket more as a callout for a trend I see in some of the Ollama tickets.

https://onnxruntime.ai/docs/execution-providers/

Originally created by @TheSpaceGod on GitHub (Aug 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6502 A repetitive issue I see coming up over and over again is people not being able to run models on their hardware for any number of reasons, one of the biggest being that llama.cpp has not incorporated the proper level of support for their chosen hardware manufacture or hardware SDK. If I understand the goal of Ollama correctly to be a nice easy way of running LLMs with an API layer over lower level LLM libraries, would it make sense to heavily consider moving to a single solution runtime that can provide support of most all hardware configurations (aka ONNX)? These are some of the most commented PRs currently and there will probably be more to come from increasingly smaller/niche HW vendors (examples: Ascend NPU or Google Coral TPU): - https://github.com/ollama/ollama/pull/5593 - https://github.com/ollama/ollama/pull/5059 - https://github.com/ollama/ollama/pull/2458 - https://github.com/ollama/ollama/pull/5426 - https://github.com/ollama/ollama/pull/5872 Would it maybe make more sense to offload this hard work to the ONNX runtime folks and keep Ollama focused on what it is good at, being a great API for LLM inference? I realize this would require a massive amount of work to implement yet another backend, but I am making this ticket more as a callout for a trend I see in some of the Ollama tickets. https://onnxruntime.ai/docs/execution-providers/
GiteaMirror added the feature request label 2026-04-12 14:59:48 -05:00
Author
Owner

@thewh1teagle commented on GitHub (Aug 28, 2024):

onnxruntime doesn't have Vulkan which simplify everything on Windows and Linux.

<!-- gh-comment-id:2315887424 --> @thewh1teagle commented on GitHub (Aug 28, 2024): onnxruntime doesn't have Vulkan which simplify everything on Windows and Linux.
Author
Owner

@TheSpaceGod commented on GitHub (Aug 28, 2024):

onnxruntime doesn't have Vulkan which simplify everything on Windows and Linux.

Correct, but it does support:

  • Nvidia Cuda/TensorRT
  • AMD ROCm
  • Intel OpenVINO

Which covers ALL 3 big GPU makers and is more than Ollama currently supports. Vulkan support would be nice though.

<!-- gh-comment-id:2316392022 --> @TheSpaceGod commented on GitHub (Aug 28, 2024): > onnxruntime doesn't have Vulkan which simplify everything on Windows and Linux. Correct, but it does support: - Nvidia Cuda/TensorRT - AMD ROCm - Intel OpenVINO Which covers ALL 3 big GPU makers and is more than Ollama currently supports. Vulkan support would be nice though.
Author
Owner

@thewh1teagle commented on GitHub (Sep 1, 2024):

  • Intel OpenVINO

Do you know if OpenVino requires special care with the model files or the ggml/gguf files or ollama works seamlessly with it?
By the way onnxruntiem also doesn't work with ggml format

<!-- gh-comment-id:2323370991 --> @thewh1teagle commented on GitHub (Sep 1, 2024): > * Intel OpenVINO Do you know if OpenVino requires special care with the model files or the ggml/gguf files or ollama works seamlessly with it? By the way onnxruntiem also doesn't work with ggml format
Author
Owner

@richard087 commented on GitHub (Dec 3, 2024):

  • Intel OpenVINO

Do you know if OpenVino requires special care with the model files or the ggml/gguf files or ollama works seamlessly with it? By the way onnxruntiem also doesn't work with ggml format

OpenVino uses its own format, in competition with ggml/gguf. It will (relatively quickly) quantise a model and it's the quantised model that gets used for inference. There is/was a bug with this process but it's likely solved now/soon.

Personally, I've been pretty unimpressed by the ONNX ecosystem, and found it's worked better on paper than in reality.

<!-- gh-comment-id:2513409776 --> @richard087 commented on GitHub (Dec 3, 2024): > > * Intel OpenVINO > > Do you know if OpenVino requires special care with the model files or the ggml/gguf files or ollama works seamlessly with it? By the way onnxruntiem also doesn't work with ggml format OpenVino uses its own format, in competition with ggml/gguf. It will (relatively quickly) quantise a model and it's the quantised model that gets used for inference. There is/was a bug with this process but it's likely solved now/soon. Personally, I've been pretty unimpressed by the ONNX ecosystem, and found it's worked better on paper than in reality.
Author
Owner

@thewh1teagle commented on GitHub (Dec 3, 2024):

Personally, I've been pretty unimpressed by the ONNX ecosystem, and found it's worked better on paper than in reality.

I’ve had cases where the runtime seemed to support features but didn’t work well. For example, DirectML is often slow even though it’s supposed to use the GPU, and it’s missing some operators. But it’s still being developed, backed by Microsoft, and has a lot of potential. Overall, it’s a good experience because it has precompiled libraries and is stable, unlike ggml, which can sometimes crash the whole program.

<!-- gh-comment-id:2513457963 --> @thewh1teagle commented on GitHub (Dec 3, 2024): > Personally, I've been pretty unimpressed by the ONNX ecosystem, and found it's worked better on paper than in reality. I’ve had cases where the runtime seemed to support features but didn’t work well. For example, DirectML is often slow even though it’s supposed to use the GPU, and it’s missing some operators. But it’s still being developed, backed by Microsoft, and has a lot of potential. Overall, it’s a good experience because it has precompiled libraries and is stable, unlike ggml, which can sometimes crash the whole program.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4094