[GH-ISSUE #2169] Inference with OpenVINO on Intel #47753

Open
opened 2026-04-28 05:10:23 -05:00 by GiteaMirror · 48 comments
Owner

Originally created by @ddpasa on GitHub (Jan 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2169

I think Intel CPUs/GPUs now support more efficient inference with OpenVINO. See example here with LLAVA: https://docs.openvino.ai/2023.2/notebooks/257-llava-multimodal-chatbot-with-output.html

It would be great if ollama could automatically default to OpenVINO on Intel systems.

Originally created by @ddpasa on GitHub (Jan 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2169 I think Intel CPUs/GPUs now support more efficient inference with OpenVINO. See example here with LLAVA: https://docs.openvino.ai/2023.2/notebooks/257-llava-multimodal-chatbot-with-output.html It would be great if ollama could automatically default to OpenVINO on Intel systems.
GiteaMirror added the feature request label 2026-04-28 05:10:23 -05:00
Author
Owner

@Kreijstal commented on GitHub (Jul 17, 2024):

this would be great, there is no need for CUDA by default (if you dont own the hardware), but your own hardware by default...

<!-- gh-comment-id:2232713818 --> @Kreijstal commented on GitHub (Jul 17, 2024): this would be great, there is no need for CUDA by default (if you dont own the hardware), but your own hardware by default...
Author
Owner

@Kreijstal commented on GitHub (Jul 17, 2024):

so either built for all systems and decide at runtime which to use, or there are two different packages ollama-cuda and ollama-openvino, right?

<!-- gh-comment-id:2232715928 --> @Kreijstal commented on GitHub (Jul 17, 2024): so either built for all systems and decide at runtime which to use, or there are two different packages ollama-cuda and ollama-openvino, right?
Author
Owner

@Kreijstal commented on GitHub (Jul 17, 2024):

I mean currently, it simply installs cuda things my hardware doesn't even support, it should at least ask, if I have cuda!

<!-- gh-comment-id:2232717445 --> @Kreijstal commented on GitHub (Jul 17, 2024): I mean currently, it simply installs cuda things my hardware doesn't even support, it should at least ask, if I have cuda!
Author
Owner

@ddpasa commented on GitHub (Jul 18, 2024):

There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.

<!-- gh-comment-id:2235994148 --> @ddpasa commented on GitHub (Jul 18, 2024): There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.
Author
Owner

@enricorampazzo commented on GitHub (Jul 22, 2024):

There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.

OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :)

<!-- gh-comment-id:2242226488 --> @enricorampazzo commented on GitHub (Jul 22, 2024): > There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working. OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :)
Author
Owner

@Kreijstal commented on GitHub (Jul 22, 2024):

There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.

OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :)

h-how do you compile this on your machine?

<!-- gh-comment-id:2242242306 --> @Kreijstal commented on GitHub (Jul 22, 2024): > > There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working. > > OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :) h-how do you compile this on your machine?
Author
Owner

@luhuaei commented on GitHub (Jul 22, 2024):

There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.

OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :)

How did you manage to do it? According to the documentation, the current OpenVINO NPU only supports models with static shapes. However, as far as I know, Llama3 is a model for causality, so OpenVINO NPU should not be able to run Llama3. Thank you for your guidance.

<!-- gh-comment-id:2242630700 --> @luhuaei commented on GitHub (Jul 22, 2024): > > There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working. > > OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :) How did you manage to do it? According to the [documentation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html#limitations), the current OpenVINO NPU only supports models with static shapes. However, as far as I know, Llama3 is a model for causality, so OpenVINO NPU should not be able to run Llama3. Thank you for your guidance.
Author
Owner

@enricorampazzo commented on GitHub (Jul 22, 2024):

I

There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working.

OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :)

How did you manage to do it? According to the documentation, the current OpenVINO NPU only supports models with static shapes. However, as far as I know, Llama3 is a model for causality, so OpenVINO NPU should not be able to run Llama3. Thank you for your guidance.

So, I am not an expert in any way shape or form: I bought a laptop with an ultra 7 processor (12th gen Thinkpad Carbon X1) on Friday, Saturday I found out about it having an NPU and today I am running some simple benchmarks, so take what follows with a lot of salt.

I have found the repo for the Intel NPU library. It comes with several examples, including one that runs Llama3 on the NPU. I had to solve/work around some issues that I have since opened (see here and here), but overall it works, and the difference between CPU and NPU time is quite astounding.
I know that is running on the NPU by looking to the performance tab in task manager, which has a separate row for the NPU

As I said I am running benchmarks and I will keep you posted if you are interested

<!-- gh-comment-id:2243078636 --> @enricorampazzo commented on GitHub (Jul 22, 2024): I > > > There will soon be a vulkan backend to ollama. I'm nor sure if openvino is still needed when that starts working. > > > > > > OpenVINO also supports the NPU, which I think would be useful: I run a simple test last night, asking llama3-8B-Instruct "how are you", the CPU answered in 64 minutes, the npu did the same in 55 seconds :) > > How did you manage to do it? According to the [documentation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html#limitations), the current OpenVINO NPU only supports models with static shapes. However, as far as I know, Llama3 is a model for causality, so OpenVINO NPU should not be able to run Llama3. Thank you for your guidance. So, I am not an expert in any way shape or form: I bought a laptop with an ultra 7 processor (12th gen Thinkpad Carbon X1) on Friday, Saturday I found out about it having an NPU and today I am running some simple benchmarks, so take what follows with a lot of salt. I have found the repo for the [Intel NPU library](https://github.com/intel/intel-npu-acceleration-library). It comes with several examples, including one that [runs Llama3 on the NPU](https://github.com/intel/intel-npu-acceleration-library/blob/main/examples/llama3.py). I had to solve/work around some issues that I have since opened (see [here](https://github.com/intel/intel-npu-acceleration-library/issues/101) and [here](https://github.com/intel/intel-npu-acceleration-library/issues/102)), but overall it works, and the difference between CPU and NPU time is quite astounding. I know that is running on the NPU by looking to the performance tab in task manager, which has a separate row for the NPU As I said I am running benchmarks and I will keep you posted if you are interested
Author
Owner

@luhuaei commented on GitHub (Jul 23, 2024):

@Giustiniano Thank you for your explanation. I'll give it a try. Thank you.

<!-- gh-comment-id:2244046652 --> @luhuaei commented on GitHub (Jul 23, 2024): @Giustiniano Thank you for your explanation. I'll give it a try. Thank you.
Author
Owner

@enricorampazzo commented on GitHub (Jul 23, 2024):

Here you can see a demonstration of llama-3 running on the NPU
https://youtu.be/p6Ohv8JXJF8

<!-- gh-comment-id:2245764871 --> @enricorampazzo commented on GitHub (Jul 23, 2024): Here you can see a demonstration of llama-3 running on the NPU https://youtu.be/p6Ohv8JXJF8
Author
Owner

@taxmeifyoucan commented on GitHub (Jul 29, 2024):

+1 this would be very helpful

<!-- gh-comment-id:2256702275 --> @taxmeifyoucan commented on GitHub (Jul 29, 2024): +1 this would be very helpful
Author
Owner

@alexma233 commented on GitHub (Aug 8, 2024):

I was wondering if there have been any updates or progress on this issue. Is there a timeline for resolving this or any workaround available? Thank you!

<!-- gh-comment-id:2275464735 --> @alexma233 commented on GitHub (Aug 8, 2024): I was wondering if there have been any updates or progress on this issue. Is there a timeline for resolving this or any workaround available? Thank you!
Author
Owner

@divemasterjm commented on GitHub (Oct 1, 2024):

+1

<!-- gh-comment-id:2384845367 --> @divemasterjm commented on GitHub (Oct 1, 2024): +1
Author
Owner

@jsapede commented on GitHub (Oct 3, 2024):

+2

<!-- gh-comment-id:2390581530 --> @jsapede commented on GitHub (Oct 3, 2024): +2
Author
Owner

@johnmmcgee commented on GitHub (Oct 6, 2024):

Would be interested as well.

<!-- gh-comment-id:2395346938 --> @johnmmcgee commented on GitHub (Oct 6, 2024): Would be interested as well.
Author
Owner

@liyimeng commented on GitHub (Oct 28, 2024):

+3

<!-- gh-comment-id:2441493829 --> @liyimeng commented on GitHub (Oct 28, 2024): +3
Author
Owner

@ghchris2021 commented on GitHub (Nov 13, 2024):

it'd be nice to see!

<!-- gh-comment-id:2472447767 --> @ghchris2021 commented on GitHub (Nov 13, 2024): it'd be nice to see!
Author
Owner

@awaLiny2333 commented on GitHub (Feb 4, 2025):

+4

<!-- gh-comment-id:2634985912 --> @awaLiny2333 commented on GitHub (Feb 4, 2025): +4
Author
Owner

@thesolomon-tech commented on GitHub (Feb 13, 2025):

I think think it would be good to aggregate issue notifications.
OpenVINO seems to have Intel NPU support according to their Github Page.
There relevant issues are #5747 and #8281 #3004.

<!-- gh-comment-id:2655410587 --> @thesolomon-tech commented on GitHub (Feb 13, 2025): I think think it would be good to aggregate issue notifications. OpenVINO seems to have Intel NPU support according to their [Github Page](https://github.com/openvinotoolkit/openvino). There relevant issues are #5747 and #8281 #3004.
Author
Owner

@dmbuil commented on GitHub (Feb 21, 2025):

+5!

<!-- gh-comment-id:2673976110 --> @dmbuil commented on GitHub (Feb 21, 2025): +5!
Author
Owner

@zhaohb commented on GitHub (Mar 24, 2025):

Hi, all
We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or https://github.com/openvinotoolkit/openvino_contrib/pull/953

<!-- gh-comment-id:2748749851 --> @zhaohb commented on GitHub (Mar 24, 2025): Hi, all We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or https://github.com/openvinotoolkit/openvino_contrib/pull/953
Author
Owner

@Kreijstal commented on GitHub (Mar 24, 2025):

@zhaohb pr when

<!-- gh-comment-id:2748973094 --> @Kreijstal commented on GitHub (Mar 24, 2025): @zhaohb pr when
Author
Owner

@ddpasa commented on GitHub (Mar 25, 2025):

Hi, all We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or openvinotoolkit/openvino_contrib#953

@zhaohb , do you plan to contribute this as a supported backend to llama.cpp?

<!-- gh-comment-id:2750649314 --> @ddpasa commented on GitHub (Mar 25, 2025): > Hi, all We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or [openvinotoolkit/openvino_contrib#953](https://github.com/openvinotoolkit/openvino_contrib/pull/953) @zhaohb , do you plan to contribute this as a supported backend to llama.cpp?
Author
Owner

@zhaohb commented on GitHub (Mar 25, 2025):

Hi, all We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or openvinotoolkit/openvino_contrib#953

@zhaohb , do you plan to contribute this as a supported backend to llama.cpp?

Hi, I guess this won't be the backend to llama.cpp, it's just the backend to Ollama.

<!-- gh-comment-id:2750675291 --> @zhaohb commented on GitHub (Mar 25, 2025): > > Hi, all We have implemented the integration of OpenVINO and Ollama: https://github.com/zhaohb/ollama_ov or [openvinotoolkit/openvino_contrib#953](https://github.com/openvinotoolkit/openvino_contrib/pull/953) > > [@zhaohb](https://github.com/zhaohb) , do you plan to contribute this as a supported backend to llama.cpp? Hi, I guess this won't be the backend to llama.cpp, it's just the backend to Ollama.
Author
Owner

@zhaohb commented on GitHub (Mar 26, 2025):

@zhaohb pr when
Yes, I want to upstream, but I don’t know if it will be merged.

<!-- gh-comment-id:2753122203 --> @zhaohb commented on GitHub (Mar 26, 2025): > [@zhaohb](https://github.com/zhaohb) pr when Yes, I want to upstream, but I don’t know if it will be merged.
Author
Owner

@FionaZZ92 commented on GitHub (Mar 26, 2025):

Ollama_OV is using OpenVINO GenAI as backend for inferencing on Intel platforms including CPU/GPU/NPU. And it can make sure no performance gap between OpenVINO published performance on Intel platforms. We appreciate the user friendly interface that Ollama provided to open community, so the goal is to make open community easier and quickly to use Ollama to get good performance which Intel been tracked and maintained. The OV backend of Ollama will be helpful to both Ollama ecosystem and OpenVINO ecosystem.

We are open and delight to upstream to Ollama as one of options. Look forward anyone can help to promote the progress.

<!-- gh-comment-id:2753221029 --> @FionaZZ92 commented on GitHub (Mar 26, 2025): Ollama_OV is using OpenVINO GenAI as backend for inferencing on Intel platforms including CPU/GPU/NPU. And it can make sure no performance gap between OpenVINO published performance on Intel platforms. We appreciate the user friendly interface that Ollama provided to open community, so the goal is to make open community easier and quickly to use Ollama to get good performance which Intel been tracked and maintained. The OV backend of Ollama will be helpful to both Ollama ecosystem and OpenVINO ecosystem. We are open and delight to upstream to Ollama as one of options. Look forward anyone can help to promote the progress.
Author
Owner

@antoniomtz commented on GitHub (Apr 10, 2025):

+6

<!-- gh-comment-id:2794984095 --> @antoniomtz commented on GitHub (Apr 10, 2025): +6
Author
Owner

@okc0mputex commented on GitHub (Apr 24, 2025):

Is there an update on this? This will help several projects.

<!-- gh-comment-id:2827204391 --> @okc0mputex commented on GitHub (Apr 24, 2025): Is there an update on this? This will help several projects.
Author
Owner

@FionaZZ92 commented on GitHub (Apr 24, 2025):

Is there an update on this? This will help several projects.

You maybe can refer this repo that OV plays backend of Ollama as a short cut.
https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino
BTW, we also working on read model from gguf file. So, once it done, I think it would be easier to seamless join with Ollama.

<!-- gh-comment-id:2827367313 --> @FionaZZ92 commented on GitHub (Apr 24, 2025): > Is there an update on this? This will help several projects. You maybe can refer this repo that OV plays backend of Ollama as a short cut. [https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) BTW, we also working on read model from gguf file. So, once it done, I think it would be easier to seamless join with Ollama.
Author
Owner

@brownplayer commented on GitHub (May 1, 2025):

这方面有更新吗?这将有助于多个项目。

你也许可以参考这个 OV 播放 Ollama 后端的 repo 作为捷径。 https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino 顺便说一句,我们还在处理从 gguf 文件中读取模型。所以,一旦完成,我认为与 Ollama 无缝连接会更容易。
There was a serious problem at that time. The cpu, igpu and npu were all unusable

<!-- gh-comment-id:2844121839 --> @brownplayer commented on GitHub (May 1, 2025): > > 这方面有更新吗?这将有助于多个项目。 > > 你也许可以参考这个 OV 播放 Ollama 后端的 repo 作为捷径。 https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino 顺便说一句,我们还在处理从 gguf 文件中读取模型。所以,一旦完成,我认为与 Ollama 无缝连接会更容易。 There was a serious problem at that time. The cpu, igpu and npu were all unusable
Author
Owner

@emaayan commented on GitHub (Jun 12, 2025):

hi, any update on this?

<!-- gh-comment-id:2965209587 --> @emaayan commented on GitHub (Jun 12, 2025): hi, any update on this?
Author
Owner

@FionaZZ92 commented on GitHub (Jun 12, 2025):

hi, any update on this?

We update new model support here, you can refer: https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino

Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

<!-- gh-comment-id:2965363718 --> @FionaZZ92 commented on GitHub (Jun 12, 2025): > hi, any update on this? We update new model support here, you can refer: [https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.
Author
Owner

@ddpasa commented on GitHub (Jun 12, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: https://github.com/ggml-org/llama.cpp/discussions/10879

<!-- gh-comment-id:2965713042 --> @ddpasa commented on GitHub (Jun 12, 2025): llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: https://github.com/ggml-org/llama.cpp/discussions/10879
Author
Owner

@emaayan commented on GitHub (Jun 12, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: ggml-org/llama.cpp#10879

i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU
i'm looking specifically for ollama so i could use jetbrains integration with it.

hi, any update on this?

We update new model support here, you can refer: openvinotoolkit/openvino_contrib@master/modules/ollama_openvino

Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

<!-- gh-comment-id:2965877158 --> @emaayan commented on GitHub (Jun 12, 2025): > llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: [ggml-org/llama.cpp#10879](https://github.com/ggml-org/llama.cpp/discussions/10879) i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU i'm looking specifically for ollama so i could use jetbrains integration with it. > > hi, any update on this? > > We update new model support here, you can refer: [openvinotoolkit/openvino_contrib@`master`/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) > > Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.
Author
Owner

@emaayan commented on GitHub (Jun 12, 2025):

hi, any update on this?

We update new model support here, you can refer: openvinotoolkit/openvino_contrib@master/modules/ollama_openvino

Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

hi, i actually tried using your stuff, but i have 2 issues. the first is trying to run your exe (which i feel rather uncomfortable downloading from a google drive) failed with this message (and i downloaded the environment)

and when i tried to compile it it also failed with complication errors on includes

Image

<!-- gh-comment-id:2965889798 --> @emaayan commented on GitHub (Jun 12, 2025): > > hi, any update on this? > > We update new model support here, you can refer: [openvinotoolkit/openvino_contrib@`master`/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) > > Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you. hi, i actually tried using your stuff, but i have 2 issues. the first is trying to run your exe (which i feel rather uncomfortable downloading from a google drive) failed with this message (and i downloaded the environment) and when i tried to compile it it also failed with complication errors on includes ![Image](https://github.com/user-attachments/assets/4e7fdab8-7aa8-49e3-bdca-3fe732f61da5)
Author
Owner

@ddpasa commented on GitHub (Jun 12, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: ggml-org/llama.cpp#10879

i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU i'm looking specifically for ollama so i could use jetbrains integration with it.

hi, any update on this?

We update new model support here, you can refer: openvinotoolkit/openvino_contrib@master/modules/ollama_openvino
Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

Intel 165U has a iGPU that supports Vulkan, so you should see a major performance gain in prompt processing and image processing from using thee Vulkan backend. My much older iris g7 gets approximately x2 speedup, you will likely get more.

<!-- gh-comment-id:2965891256 --> @ddpasa commented on GitHub (Jun 12, 2025): > > llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: [ggml-org/llama.cpp#10879](https://github.com/ggml-org/llama.cpp/discussions/10879) > > i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU i'm looking specifically for ollama so i could use jetbrains integration with it. > > > > hi, any update on this? > > > > > > We update new model support here, you can refer: [openvinotoolkit/openvino_contrib@`master`/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) > > Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you. Intel 165U has a iGPU that supports Vulkan, so you should see a major performance gain in prompt processing and image processing from using thee Vulkan backend. My much older iris g7 gets approximately x2 speedup, you will likely get more.
Author
Owner

@emaayan commented on GitHub (Jun 12, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: ggml-org/llama.cpp#10879

i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU i'm looking specifically for ollama so i could use jetbrains integration with it.

hi, any update on this?

We update new model support here, you can refer: openvinotoolkit/openvino_contrib@master/modules/ollama_openvino
Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

Intel 165U has a iGPU that supports Vulkan, so you should see a major performance gain in prompt processing and image processing from using thee Vulkan backend. My much older iris g7 gets approximately x2 speedup, you will likely get more.

so this leaves the question of can it be used as Ollama server? because jetbrains integrates either with ollama or lm studio

<!-- gh-comment-id:2965967688 --> @emaayan commented on GitHub (Jun 12, 2025): > > > llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: [ggml-org/llama.cpp#10879](https://github.com/ggml-org/llama.cpp/discussions/10879) > > > > > > i don't exactly know what's a vulkan backend but i have a lenovo t14g5 no gpu, just Intel Core ultra 7 165U and an NPU i'm looking specifically for ollama so i could use jetbrains integration with it. > > > > hi, any update on this? > > > > > > > > > We update new model support here, you can refer: [openvinotoolkit/openvino_contrib@`master`/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) > > > Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you. > > Intel 165U has a iGPU that supports Vulkan, so you should see a major performance gain in prompt processing and image processing from using thee Vulkan backend. My much older iris g7 gets approximately x2 speedup, you will likely get more. so this leaves the question of can it be used as Ollama server? because jetbrains integrates either with ollama or lm studio
Author
Owner

@FionaZZ92 commented on GitHub (Jun 12, 2025):

hi, any update on this?

We update new model support here, you can refer: openvinotoolkit/openvino_contrib@master/modules/ollama_openvino
Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you.

hi, i actually tried using your stuff, but i have 2 issues. the first is trying to run your exe (which i feel rather uncomfortable downloading from a google drive) failed with this message (and i downloaded the environment)

and when i tried to compile it it also failed with complication errors on includes

Image

To successfully run, You should follow steps in above repo, download OpenVINO GenAI pkg as well. If you have any usgae question with Ollama-OV, feel free to ask there, will help to answer. https://github.com/openvinotoolkit/openvino_contrib/issues

<!-- gh-comment-id:2966335599 --> @FionaZZ92 commented on GitHub (Jun 12, 2025): > > > hi, any update on this? > > > > > > We update new model support here, you can refer: [openvinotoolkit/openvino_contrib@`master`/modules/ollama_openvino](https://github.com/openvinotoolkit/openvino_contrib/tree/master/modules/ollama_openvino) > > Due to gguf model size is not the smallest, it contains much Q6 layers. In execution level, can only fallback to Q8/FP16. So, the performance is not the best. If you pursue perf boost on Intel platform. I suggest you follow above link. Thank you. > > hi, i actually tried using your stuff, but i have 2 issues. the first is trying to run your exe (which i feel rather uncomfortable downloading from a google drive) failed with this message (and i downloaded the environment) > > and when i tried to compile it it also failed with complication errors on includes > > ![Image](https://github.com/user-attachments/assets/4e7fdab8-7aa8-49e3-bdca-3fe732f61da5) To successfully run, You should follow steps in above repo, download OpenVINO GenAI pkg as well. If you have any usgae question with Ollama-OV, feel free to ask there, will help to answer. [https://github.com/openvinotoolkit/openvino_contrib/issues](https://github.com/openvinotoolkit/openvino_contrib/issues)
Author
Owner

@ghchris2021 commented on GitHub (Jun 21, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: ggml-org/llama.cpp#10879

I am recently starting to revisit what inference & UI / local OpenAI API server options work for me on intel gpu.

A week or so ago I tried ggml.org's own llama.cpp releases with their these different builds / back ends: sycl, vulkan and as far as I recall the llama.cpp sycl option often worked with more performance benchmarked on the same GGUF model files as compared to vulkan performing less quickly in noteworthy cases.
The sycl version was a few weeks older than the vulkan version at that time but there are now newer ones from yesterday for both which I have not tried.

I also tried the most recently available ipex-llm inference SW from intel which IIRC uses a heavily modified llama.cpp (or maybe I'm wrong) and it somehow uses the intel IPEX as a part of the inference process. I found at the time I tested this one to be very significantly faster than either the ggml.org llama.cpp sycl or vulkan builds, almost 2x faster in one relevant case that I somewhat recall.

I did not then or since get time to compare benchmarking ollama, huggingface transformers, or onnx for inference of the same general models but those are also possible inference options which might have particular advantages and disadvantages for some use cases.

<!-- gh-comment-id:2993807046 --> @ghchris2021 commented on GitHub (Jun 21, 2025): > llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: [ggml-org/llama.cpp#10879](https://github.com/ggml-org/llama.cpp/discussions/10879) I am recently starting to revisit what inference & UI / local OpenAI API server options work for me on intel gpu. A week or so ago I tried ggml.org's own llama.cpp releases with their these different builds / back ends: sycl, vulkan and as far as I recall the llama.cpp sycl option often worked with more performance benchmarked on the same GGUF model files as compared to vulkan performing less quickly in noteworthy cases. The sycl version was a few weeks older than the vulkan version at that time but there are now newer ones from yesterday for both which I have not tried. I also tried the most recently available ipex-llm inference SW from intel which IIRC uses a heavily modified llama.cpp (or maybe I'm wrong) and it somehow uses the intel IPEX as a part of the inference process. I found at the time I tested this one to be very significantly faster than either the ggml.org llama.cpp sycl or vulkan builds, almost 2x faster in one relevant case that I somewhat recall. I did not then or since get time to compare benchmarking ollama, huggingface transformers, or onnx for inference of the same general models but those are also possible inference options which might have particular advantages and disadvantages for some use cases.
Author
Owner

@emaayan commented on GitHub (Jun 22, 2025):

llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: ggml-org/llama.cpp#10879

I am recently starting to revisit what inference & UI / local OpenAI API server options work for me on intel gpu.

A week or so ago I tried ggml.org's own llama.cpp releases with their these different builds / back ends: sycl, vulkan and as far as I recall the llama.cpp sycl option often worked with more performance benchmarked on the same GGUF model files as compared to vulkan performing less quickly in noteworthy cases. The sycl version was a few weeks older than the vulkan version at that time but there are now newer ones from yesterday for both which I have not tried.

I also tried the most recently available ipex-llm inference SW from intel which IIRC uses a heavily modified llama.cpp (or maybe I'm wrong) and it somehow uses the intel IPEX as a part of the inference process. I found at the time I tested this one to be very significantly faster than either the ggml.org llama.cpp sycl or vulkan builds, almost 2x faster in one relevant case that I somewhat recall.

I did not then or since get time to compare benchmarking ollama, huggingface transformers, or onnx for inference of the same general models but those are also possible inference options which might have particular advantages and disadvantages for some use cases.

are you talking about this thing? https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md

<!-- gh-comment-id:2993977350 --> @emaayan commented on GitHub (Jun 22, 2025): > > llama.cpp with Vulkan backend works very well with intel GPUs, you can find benchmarks here: [ggml-org/llama.cpp#10879](https://github.com/ggml-org/llama.cpp/discussions/10879) > > I am recently starting to revisit what inference & UI / local OpenAI API server options work for me on intel gpu. > > A week or so ago I tried ggml.org's own llama.cpp releases with their these different builds / back ends: sycl, vulkan and as far as I recall the llama.cpp sycl option often worked with more performance benchmarked on the same GGUF model files as compared to vulkan performing less quickly in noteworthy cases. The sycl version was a few weeks older than the vulkan version at that time but there are now newer ones from yesterday for both which I have not tried. > > I also tried the most recently available ipex-llm inference SW from intel which IIRC uses a heavily modified llama.cpp (or maybe I'm wrong) and it somehow uses the intel IPEX as a part of the inference process. I found at the time I tested this one to be very significantly faster than either the ggml.org llama.cpp sycl or vulkan builds, almost 2x faster in one relevant case that I somewhat recall. > > I did not then or since get time to compare benchmarking ollama, huggingface transformers, or onnx for inference of the same general models but those are also possible inference options which might have particular advantages and disadvantages for some use cases. are you talking about this thing? https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md
Author
Owner

@ghchris2021 commented on GitHub (Jun 22, 2025):

are you talking about this thing? https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md

Yes, sort of. I didn't specifically use the ollama tie-in, just the ipex-llm basic release which has llama.cpp like tool binaries for benchmarking, serving, etc. So I ran some benchmarks to compare it to ggml.org's llama.cpp's benchmark in their sycl and vulkan builds.

https://github.com/intel/ipex-llm/releases

https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly

The ipex-llm "nightly build" doesn't actually seem to be recently made and published by those links but maybe there are newer builds somewhere otherwise I suspect there will eventually be new versions. Anyway it was one more thing to test.

Compared to the ggml.org llama.cpp release builds IIRC I found ipex-llm to be interestingly faster in some interesting benchmark use case I tried. I suspect that would also translate to actual "server" and "ollama" performance when it's based on the same underlying inference code version.

For points of comparison there's the 'full-intel' and 'full-vulkan' ggml.org llama.cpp builds / versions such as these which are pretty recently made as of the past day anyway:

https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/versions

https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/443832581?tag=full-intel-b5732

https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/443831174?tag=full-vulkan-b5732

And when I get around to it I'll compare with ollama, huggingface transformers in various configurations and the newest openvino.

<!-- gh-comment-id:2994212075 --> @ghchris2021 commented on GitHub (Jun 22, 2025): > are you talking about this thing? https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md Yes, sort of. I didn't specifically use the ollama tie-in, just the ipex-llm basic release which has llama.cpp like tool binaries for benchmarking, serving, etc. So I ran some benchmarks to compare it to ggml.org's llama.cpp's benchmark in their sycl and vulkan builds. https://github.com/intel/ipex-llm/releases https://github.com/ipex-llm/ipex-llm/releases/tag/v2.3.0-nightly The ipex-llm "nightly build" doesn't actually seem to be recently made and published by those links but maybe there are newer builds somewhere otherwise I suspect there will eventually be new versions. Anyway it was one more thing to test. Compared to the ggml.org llama.cpp release builds IIRC I found ipex-llm to be interestingly faster in some interesting benchmark use case I tried. I suspect that would also translate to actual "server" and "ollama" performance when it's based on the same underlying inference code version. For points of comparison there's the 'full-intel' and 'full-vulkan' ggml.org llama.cpp builds / versions such as these which are pretty recently made as of the past day anyway: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/versions https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/443832581?tag=full-intel-b5732 https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/443831174?tag=full-vulkan-b5732 And when I get around to it I'll compare with ollama, huggingface transformers in various configurations and the newest openvino.
Author
Owner

@emaayan commented on GitHub (Jun 22, 2025):

i tried the ollama 2.3.0 with qwen code 2.5 instruct and it on my intel ultra 165U and it didn't seem to be working any faster with i aksed it analyze a java class.

<!-- gh-comment-id:2994347790 --> @emaayan commented on GitHub (Jun 22, 2025): i tried the ollama 2.3.0 with qwen code 2.5 instruct and it on my intel ultra 165U and it didn't seem to be working any faster with i aksed it analyze a java class.
Author
Owner

@kamikaze commented on GitHub (Jan 14, 2026):

Looks like an already abandoned project

<!-- gh-comment-id:3750988386 --> @kamikaze commented on GitHub (Jan 14, 2026): Looks like an already abandoned project
Author
Owner

@ddpasa commented on GitHub (Jan 14, 2026):

Looks like an already abandoned project

just use llama.cpp, they already support intel GPUs.

<!-- gh-comment-id:3751085568 --> @ddpasa commented on GitHub (Jan 14, 2026): > Looks like an already abandoned project just use llama.cpp, they already support intel GPUs.
Author
Owner

@kamikaze commented on GitHub (Jan 14, 2026):

Looks like an already abandoned project

just use llama.cpp, they already support intel GPUs.

but not NPU

<!-- gh-comment-id:3751206909 --> @kamikaze commented on GitHub (Jan 14, 2026): > > Looks like an already abandoned project > > just use llama.cpp, they already support intel GPUs. but not NPU
Author
Owner

@ddpasa commented on GitHub (Jan 15, 2026):

Looks like an already abandoned project

just use llama.cpp, they already support intel GPUs.

but not NPU

Open a ticket with the llama.cpp folks. They are great at this kind of stuff.

<!-- gh-comment-id:3756398420 --> @ddpasa commented on GitHub (Jan 15, 2026): > > > Looks like an already abandoned project > > > > > > just use llama.cpp, they already support intel GPUs. > > but not NPU Open a ticket with the llama.cpp folks. They are great at this kind of stuff.
Author
Owner

@rklec commented on GitHub (Jan 27, 2026):

See https://github.com/ggml-org/llama.cpp/issues/5079 and https://github.com/ggml-org/llama.cpp/issues/9181 it's a reoccurring issue apparently.

Apparently here is also a repo by the Intel people: https://github.com/intel/ipex-llm

<!-- gh-comment-id:3804245340 --> @rklec commented on GitHub (Jan 27, 2026): See https://github.com/ggml-org/llama.cpp/issues/5079 and https://github.com/ggml-org/llama.cpp/issues/9181 it's a reoccurring issue apparently. Apparently here is also a repo by the Intel people: https://github.com/intel/ipex-llm
Author
Owner

@jclab-joseph commented on GitHub (Mar 30, 2026):

I hope OpenVino is integrated into Ollama.
Maybe can refer to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/15307

<!-- gh-comment-id:4153787482 --> @jclab-joseph commented on GitHub (Mar 30, 2026): I hope OpenVino is integrated into Ollama. Maybe can refer to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/15307
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47753