[GH-ISSUE #3113] Integrated Intel GPU support #27673

Open
opened 2026-04-22 05:11:54 -05:00 by GiteaMirror · 33 comments
Owner

Originally created by @clvgt12 on GitHub (Mar 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3113

Originally assigned to: @dhiltgen on GitHub.

Hello,

Please consider adapting Ollama to use Intel Integrated Graphics Processors (such as the Intel Iris Xe Graphics cores) in the future.

Originally created by @clvgt12 on GitHub (Mar 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3113 Originally assigned to: @dhiltgen on GitHub. Hello, Please consider adapting Ollama to use Intel Integrated Graphics Processors (such as the Intel Iris Xe Graphics cores) in the future.
GiteaMirror added the intelfeature request labels 2026-04-22 05:11:54 -05:00
Author
Owner

@ddpasa commented on GitHub (Mar 13, 2024):

take a look at: https://github.com/ollama/ollama/pull/2578

<!-- gh-comment-id:1995656092 --> @ddpasa commented on GitHub (Mar 13, 2024): take a look at: https://github.com/ollama/ollama/pull/2578
Author
Owner

@clvgt12 commented on GitHub (Mar 13, 2024):

Very nice! looking forward to testing it on my Windows PC running Ollama in the future!

<!-- gh-comment-id:1995669495 --> @clvgt12 commented on GitHub (Mar 13, 2024): Very nice! looking forward to testing it on my Windows PC running Ollama in the future!
Author
Owner

@vinaykharayat commented on GitHub (Mar 27, 2024):

+1

<!-- gh-comment-id:2021871694 --> @vinaykharayat commented on GitHub (Mar 27, 2024): +1
Author
Owner

@MarkWard0110 commented on GitHub (Apr 19, 2024):

I would imagine for anyone who has an Intel integrated GPU, the otherwise unused GPU would add an additional GPU to utilize.
Even if it was limited to 3GB. That would be an additional 3GB GPU that could be utilized. It is a 3GB GPU that is not utilized when a model is split between an Nvidia GPU and CPU.
I am running a headless server and the integrated GPU is there and not doing anything to help.

<!-- gh-comment-id:2066695747 --> @MarkWard0110 commented on GitHub (Apr 19, 2024): I would imagine for anyone who has an Intel integrated GPU, the otherwise unused GPU would add an additional GPU to utilize. Even if it was limited to 3GB. That would be an additional 3GB GPU that could be utilized. It is a 3GB GPU that is not utilized when a model is split between an Nvidia GPU and CPU. I am running a headless server and the integrated GPU is there and not doing anything to help.
Author
Owner

@carlos-burelo commented on GitHub (Apr 27, 2024):

+1

<!-- gh-comment-id:2080386875 --> @carlos-burelo commented on GitHub (Apr 27, 2024): +1
Author
Owner

@sspanogle commented on GitHub (Jun 3, 2024):

+1 xo

<!-- gh-comment-id:2146013102 --> @sspanogle commented on GitHub (Jun 3, 2024): +1 xo
Author
Owner

@alexb7373 commented on GitHub (Jun 5, 2024):

+1

<!-- gh-comment-id:2149205679 --> @alexb7373 commented on GitHub (Jun 5, 2024): +1
Author
Owner

@suncloudsmoon commented on GitHub (Jun 7, 2024):

+1

<!-- gh-comment-id:2154234607 --> @suncloudsmoon commented on GitHub (Jun 7, 2024): +1
Author
Owner

@serhatsatir commented on GitHub (Jun 10, 2024):

I also have Intel Iris Xe with 8GB ram. But I can't see any benefit. It would be very useful if the hardware we have could be used to its full capacity.
Perhaps consulting AI on how to do it could be a solution. 😂

<!-- gh-comment-id:2158847858 --> @serhatsatir commented on GitHub (Jun 10, 2024): I also have Intel Iris Xe with 8GB ram. But I can't see any benefit. It would be very useful if the hardware we have could be used to its full capacity. Perhaps consulting AI on how to do it could be a solution. 😂
Author
Owner

@carlos-burelo commented on GitHub (Aug 1, 2024):

Personally, I have tried using WebLLM to run AI models like Llama3. When I do this, I notice an improvement in token generation speed because the Intel graphics card is being utilized via the WebGPU API. I wouldn't say the improvement is radical, but it is slightly faster. With some caution, I would estimate the improvement to be around 15%. My specifications are:

Chipset: Intel Core i7 1165G7
Graphics: Intel Iris Xe Graphics 15.7 GB shared memory
RAM: 32 GB DDR4 2667MT/s

Therefore, results may vary significantly depending on the specifications of each system.

<!-- gh-comment-id:2264128729 --> @carlos-burelo commented on GitHub (Aug 1, 2024): Personally, I have tried using [WebLLM](https://github.com/mlc-ai/web-llm) to run AI models like Llama3. When I do this, I notice an improvement in token generation speed because the Intel graphics card is being utilized via the WebGPU API. I wouldn't say the improvement is radical, but it is slightly faster. With some caution, I would estimate the improvement to be around 15%. My specifications are: Chipset: Intel Core i7 1165G7 Graphics: Intel Iris Xe Graphics 15.7 GB shared memory RAM: 32 GB DDR4 2667MT/s Therefore, results may vary significantly depending on the specifications of each system.
Author
Owner

@jomardyan commented on GitHub (Aug 10, 2024):

+1

<!-- gh-comment-id:2281038956 --> @jomardyan commented on GitHub (Aug 10, 2024): +1
Author
Owner

@ayttop commented on GitHub (Aug 24, 2024):

ollama with igpu intel

ollama run with igpu intel with

https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

<!-- gh-comment-id:2308557353 --> @ayttop commented on GitHub (Aug 24, 2024): ollama with igpu intel ollama run with igpu intel with https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md
Author
Owner

@user7z commented on GitHub (Sep 29, 2024):

ollama with igpu intel

ollama run with igpu intel with

https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

integrated gpu support isnt that well with current versions, see this issue

<!-- gh-comment-id:2381034058 --> @user7z commented on GitHub (Sep 29, 2024): > ollama with igpu intel > > ollama run with igpu intel with > > https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md integrated gpu support isnt that well with current versions, see this [issue](https://github.com/intel-analytics/ipex-llm/issues/12120#issuecomment-2379372351)
Author
Owner

@havardthom commented on GitHub (Oct 28, 2024):

For anyone interested, I've added an Ollama LXC script to tteck's Proxmox Helper-Scripts. The script installs intel-basekit and builds Ollama from source and supports Intel iGPU passthrough (though it has a very long install time). It can be run like any other proxmox helper script: bash -c "$(wget -qLO - https://github.com/tteck/Proxmox/raw/main/ct/ollama.sh)"

A script for Open WebUI LXC with optional Ollama install is also available: https://tteck.github.io/Proxmox/#open-webui-lxc

Edit: NVM Ollama does not support iGPU because of vram reporting issues, need to wait for this https://github.com/ollama/ollama/pull/5593

<!-- gh-comment-id:2442098811 --> @havardthom commented on GitHub (Oct 28, 2024): For anyone interested, I've added an Ollama LXC script to tteck's Proxmox Helper-Scripts. The script installs intel-basekit and builds Ollama from source and ~~supports Intel iGPU passthrough~~ (though it has a very long install time). It can be run like any other proxmox helper script: `bash -c "$(wget -qLO - https://github.com/tteck/Proxmox/raw/main/ct/ollama.sh)"` A script for Open WebUI LXC with optional Ollama install is also available: https://tteck.github.io/Proxmox/#open-webui-lxc **Edit: NVM Ollama does not support iGPU because of vram reporting issues, need to wait for this** https://github.com/ollama/ollama/pull/5593
Author
Owner

@maczet commented on GitHub (Nov 30, 2024):

+1

<!-- gh-comment-id:2509443121 --> @maczet commented on GitHub (Nov 30, 2024): +1
Author
Owner

@SystemStrategy commented on GitHub (Dec 2, 2024):

I ran the tteck ollama and would not load some models and was way slower compared to the docker version of Ipex

Intel Docker IPEX:
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md

Added commands to auto-start:
https://github.com/SystemStrategy/Proxmox/blob/main/Ipex_Compose

<!-- gh-comment-id:2513016621 --> @SystemStrategy commented on GitHub (Dec 2, 2024): I ran the tteck ollama and would not load some models and was way slower compared to the docker version of Ipex Intel Docker IPEX: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md Added commands to auto-start: https://github.com/SystemStrategy/Proxmox/blob/main/Ipex_Compose
Author
Owner

@gaborkukucska commented on GitHub (Dec 7, 2024):

+1

<!-- gh-comment-id:2524898397 --> @gaborkukucska commented on GitHub (Dec 7, 2024): +1
Author
Owner

@ChrisBGL commented on GitHub (Dec 8, 2024):

Please add native support for intel iGPU

<!-- gh-comment-id:2525461229 --> @ChrisBGL commented on GitHub (Dec 8, 2024): Please add native support for intel iGPU
Author
Owner

@user7z commented on GitHub (Dec 8, 2024):

I think its not ollama devs probleme , itd an intel probleme that they cant make their oneAPI be usable by the community , & the obscure way ipex-llm is bieng developped is just insane & wouldnt make it possible for ollama devs to integrate it , its intel's probleme.

<!-- gh-comment-id:2525550191 --> @user7z commented on GitHub (Dec 8, 2024): I think its not ollama devs probleme , itd an intel probleme that they cant make their oneAPI be usable by the community , & the obscure way ipex-llm is bieng developped is just insane & wouldnt make it possible for ollama devs to integrate it , its intel's probleme.
Author
Owner

@MaoJianwei commented on GitHub (Jul 15, 2025):

Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400

<!-- gh-comment-id:3073555863 --> @MaoJianwei commented on GitHub (Jul 15, 2025): Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400
Author
Owner

@ddpasa commented on GitHub (Jul 15, 2025):

Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400

I'm using an intel iGPU with llama.cpp with Vulkan. Ollama is just a wrapper for llama.cpp, but for reasons unknown the devs refuse to enable Vulkan. If you have an Intel iGPU, my recommendation is to use llama.cpp directly.

<!-- gh-comment-id:3074924064 --> @ddpasa commented on GitHub (Jul 15, 2025): > Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400 I'm using an intel iGPU with llama.cpp with Vulkan. Ollama is just a wrapper for llama.cpp, but for reasons unknown the devs refuse to enable Vulkan. If you have an Intel iGPU, my recommendation is to use llama.cpp directly.
Author
Owner

@NeoZhangJianyu commented on GitHub (Jul 16, 2025):

This issue has a solution by https://github.com/ollama/ollama/issues/8414.

<!-- gh-comment-id:3076417418 --> @NeoZhangJianyu commented on GitHub (Jul 16, 2025): This issue has a solution by https://github.com/ollama/ollama/issues/8414.
Author
Owner

@MaoJianwei commented on GitHub (Jul 21, 2025):

Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400

I'm using an intel iGPU with llama.cpp with Vulkan. Ollama is just a wrapper for llama.cpp, but for reasons unknown the devs refuse to enable Vulkan. If you have an Intel iGPU, my recommendation is to use llama.cpp directly.

@ddpasa Many thanks, I have tried to run Ollama with Irix Xe(iGPU) of Intel i7, but I found the inference speed of iGPU is near to that of CPU.
(about 18 tokens/s against 19 tokens/s)

So I think it makes no sence to attemp to run Ollama with iGPU.

<!-- gh-comment-id:3097336152 --> @MaoJianwei commented on GitHub (Jul 21, 2025): > > Can ollama use Intel integrated GPU to speed up inference?e.g. Intel UHD Graphics 630 of i5-10400 > > I'm using an intel iGPU with llama.cpp with Vulkan. Ollama is just a wrapper for llama.cpp, but for reasons unknown the devs refuse to enable Vulkan. If you have an Intel iGPU, my recommendation is to use llama.cpp directly. @ddpasa Many thanks, I have tried to run Ollama with Irix Xe(iGPU) of Intel i7, but I found the inference speed of iGPU is near to that of CPU. **(about 18 tokens/s against 19 tokens/s)** So I think it makes no sence to attemp to run Ollama with iGPU.
Author
Owner

@MaoJianwei commented on GitHub (Jul 21, 2025):

This issue has a solution by #8414.

No, #8414 doesn't support 10th Intel CPU @NeoZhangJianyu

<!-- gh-comment-id:3097340649 --> @MaoJianwei commented on GitHub (Jul 21, 2025): > This issue has a solution by [#8414](https://github.com/ollama/ollama/issues/8414). No, #8414 doesn't support 10th Intel CPU @NeoZhangJianyu
Author
Owner

@Gunnarr970 commented on GitHub (Jul 21, 2025):

So I think it makes no sence to attemp to run Ollama with iGPU.

Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used

<!-- gh-comment-id:3097484532 --> @Gunnarr970 commented on GitHub (Jul 21, 2025): > So I think it makes no sence to attemp to run Ollama with iGPU. Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used
Author
Owner

@NeoZhangJianyu commented on GitHub (Jul 22, 2025):

This issue has a solution by #8414.

No, #8414 doesn't support 10th Intel CPU @NeoZhangJianyu

The iGPU in 10th CPU don't supported by oneAPI (SYCL).
It's the root cause that llama.cpp SYCL backend can't support it.

Refer to hardware.

<!-- gh-comment-id:3100267574 --> @NeoZhangJianyu commented on GitHub (Jul 22, 2025): > > This issue has a solution by [#8414](https://github.com/ollama/ollama/issues/8414). > > No, [#8414](https://github.com/ollama/ollama/issues/8414) doesn't support 10th Intel CPU [@NeoZhangJianyu](https://github.com/NeoZhangJianyu) The iGPU in 10th CPU don't supported by oneAPI (SYCL). It's the root cause that llama.cpp SYCL backend can't support it. Refer to [hardware](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#hardware).
Author
Owner

@MaoJianwei commented on GitHub (Jul 22, 2025):

So I think it makes no sence to attemp to run Ollama with iGPU.

Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used

Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied.

<!-- gh-comment-id:3101772217 --> @MaoJianwei commented on GitHub (Jul 22, 2025): > > So I think it makes no sence to attemp to run Ollama with iGPU. > > Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied.
Author
Owner

@MaoJianwei commented on GitHub (Jul 22, 2025):

The iGPU in 10th CPU don't supported by oneAPI (SYCL). It's the root cause that llama.cpp SYCL backend can't support it.

Refer to hardware.

Thanks, I see. I bought my computer one year earlier. What a poty.

<!-- gh-comment-id:3101783946 --> @MaoJianwei commented on GitHub (Jul 22, 2025): > > The iGPU in 10th CPU don't supported by oneAPI (SYCL). It's the root cause that llama.cpp SYCL backend can't support it. > > Refer to [hardware](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#hardware). Thanks, I see. I bought my computer one year earlier. What a poty.
Author
Owner

@ddpasa commented on GitHub (Jul 22, 2025):

So I think it makes no sence to attemp to run Ollama with iGPU.

Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used

Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied.

Token generation is limited by memory bandwidth, so you'll see very similar speeds for CPU or iGPU. Th iGPU helps with input token processing or image processing. I'm getting x2 to x3 speedups on input token processing or VLMs on an Intel 10th gen iGPU when using llama.cpp with Vulkan.

<!-- gh-comment-id:3101854570 --> @ddpasa commented on GitHub (Jul 22, 2025): > > > So I think it makes no sence to attemp to run Ollama with iGPU. > > > > > > Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used > > Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied. Token generation is limited by memory bandwidth, so you'll see very similar speeds for CPU or iGPU. Th iGPU helps with input token processing or image processing. I'm getting x2 to x3 speedups on input token processing or VLMs on an Intel 10th gen iGPU when using llama.cpp with Vulkan.
Author
Owner

@MaoJianwei commented on GitHub (Jul 22, 2025):

So I think it makes no sence to attemp to run Ollama with iGPU.

Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used

Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied.

Token generation is limited by memory bandwidth, so you'll see very similar speeds for CPU or iGPU. Th iGPU helps with input token processing or image processing. I'm getting x2 to x3 speedups on input token processing or VLMs on an Intel 10th gen iGPU when using llama.cpp with Vulkan.

Do you mean iGPU can speed up Prefill phase but not Decode phase? @ddpasa

<!-- gh-comment-id:3103083832 --> @MaoJianwei commented on GitHub (Jul 22, 2025): > > > > So I think it makes no sence to attemp to run Ollama with iGPU. > > > > > > > > > Even if the speed is the same, CPU resources will continue to be available for other processes if the GPU is used > > > > > > Yes, but my purpose of using iGPU is to speed up inference, this expection is not satisfied. > > Token generation is limited by memory bandwidth, so you'll see very similar speeds for CPU or iGPU. Th iGPU helps with input token processing or image processing. I'm getting x2 to x3 speedups on input token processing or VLMs on an Intel 10th gen iGPU when using llama.cpp with Vulkan. Do you mean iGPU can speed up Prefill phase but not Decode phase? @ddpasa
Author
Owner

@MaoJianwei commented on GitHub (Aug 10, 2025):

I found the solution! That's crazy! @NeoZhangJianyu

https://github.com/ggml-org/llama.cpp/issues/1956

<!-- gh-comment-id:3172334737 --> @MaoJianwei commented on GitHub (Aug 10, 2025): I found the solution! That's crazy! @NeoZhangJianyu https://github.com/ggml-org/llama.cpp/issues/1956
Author
Owner

@vsenn commented on GitHub (Feb 6, 2026):

+1. It would be great to have iGPU support for Intel hardware, as I have stuff like Intel UHD Graphics @ 1.10 GHz [Integrated], Intel Comet Lake UHD Graphics @ 1.15 GHz [Integrated] with 65 GB RAM\VRAM in my Intel NUC machines running Kubernetes cluster.

<!-- gh-comment-id:3858878501 --> @vsenn commented on GitHub (Feb 6, 2026): +1. It would be great to have iGPU support for Intel hardware, as I have stuff like Intel UHD Graphics @ 1.10 GHz [Integrated], Intel Comet Lake UHD Graphics @ 1.15 GHz [Integrated] with 65 GB RAM\VRAM in my Intel NUC machines running Kubernetes cluster.
Author
Owner

@MaoJianwei commented on GitHub (Feb 6, 2026):

I found the solution! https://github.com/ggml-org/llama.cpp/issues/1956#issuecomment-3172333543
@vsenn

<!-- gh-comment-id:3859367207 --> @MaoJianwei commented on GitHub (Feb 6, 2026): I found the solution! https://github.com/ggml-org/llama.cpp/issues/1956#issuecomment-3172333543 @vsenn
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27673